This PhD thesis was motivated by practical challenges related to implementing and validating a vertical Rasch scale to measure students’ mathematics abilities throughout compulsory school in Northwestern Switzerland. The goal of this vertical scale is to provide third through ninth grade students with objective, reliable, and valid assessment reports based on two different assessment instruments. In Chapter 1 and 2, the two assessment instruments are introduced. By integrating their similarities and differences with the theoretical background on data-collection designs and item calibration within a Rasch framework, a four-step item calibration process is proposed to establish a vertical scale and link the two instruments. Subsequently, three studies are presented which examine specific aspects of the implementation and validation of the vertical scale. The first study (Chapter 3) investigates through simulations whether calibration efficiency under the Rasch model could be enhanced through targeted multistage calibration designs, which consider ability-related background variables and performance for assigning students with suitable items. Furthermore, it evaluates whether uncertainty about item difficulty could impair assembly of an efficient calibration design. The second study (Chapter 4) directs focus from efficient item calibration toward efficient ability estimation. Through simulations, the efficiency of a targeted multistage test design is compared to that of a traditional targeted test design and a multistage test design. The study also analyzes the extent to which each design’s efficiency depends on the correlation between the ability-related background variable and students’ true abilities, each student’s ability level and categorization into an ability group, and the length of the starting module. The third study (Chapter 5) is based on data from preliminary calibration assessments for establishing the vertical scale. The psychometric properties of the scale are examined through item analysis and by comparing concurrent and grade-by-grade calibration procedures. The content-related validity of the scale is evaluated by contrasting the empirical item difficulty estimates with the content-related item difficulties reflected in the underlying competence levels of the curriculum. In conclusion, this PhD thesis underpins the justification of an assessment system, which offers a unique opportunity to monitor students’ learning trajectories throughout compulsory school.