Doug McRae is a retired educational measurement specialist who has served as an educational testing company executive in charge of design and development of K-12 tests widely used across the United States, as well as an adviser on the initial design and development of California’s STAR assessment system. He’s knows just a little something about assessment development.
He has some concerns with Smarter Balanced he writes in EdSource. An excerpt:
I did the online exercise for grade 3 English Language Arts, and for this grade level and content area traditional multiple-choice questions dominated. In fact, 84 % of the questions were either multiple-choice or “check-the-box” questions that could be electronically scored, and these questions were very similar or identical to traditional “bubble” tests. Only 16 percent of the questions were open-ended questions, which many observers say are needed to measure Common Core standards.
The online exercise used a set of test items with the questions arranged in sequence by order of difficulty, from easy questions to hard questions. The exercise asked the participant to identify the first item in the sequence that a Category 3 or B-minus student would have less than a 50 percent chance to answer correctly. I identified that item after reviewing about 25 percent of the items to be reviewed. If a Category 3 or proficient cut score is set at only 25 percent of the available items or score points for a test that has primarily multiple-choice questions, clearly that cut score invites a strategy of randomly marking the answer sheet. The odds are that if a student uses a random marking strategy, he or she will get a proficient score quite often. This circumstance would result in many random (or invalid and unreliable) scores from the test, and reduce the overall credibility of the entire testing program.
It troubled me greatly that many of the test questions later in the sequence appeared to be far easier than the item I identified as the item marking a Category 3 or proficient cut score, per the directions for the online exercise. I found at least a quarter of the remaining items to be easier, including a cluster of clearly easier items placed about 2/3 of the way into the entire sequence. This calls into question whether or not the sequence of test questions used by Smarter Balanced was indeed in difficulty order from easy to hard items. If the sequence used was not strictly ordered from easy to hard test questions, then the results of the entire exercise have to be called into serious question.
There were several additional concerns about the Smarter Balanced cut-score-setting exercise this October that are too technical for full discussion in this commentary. Briefly, the exercise appeared not to include any use of “consequence” data that typically is included in a robust cut-score-setting process. Consequence data is estimated information on what percent of students will fall in each performance category, given the cut scores being recommended. I also questioned whether the spring 2014 Smarter Balanced field test data were used to guide the exercise in any significant way. Indeed, since the 2014 Smarter Balanced field test was essentially an item-tryout exercise, an exercise designed to qualify test questions for use in final tests, it did not generate the type of data needed for final cut score determinations in a number of significant ways.