When we ask if a test is valid, it is critical to ask: valid for what purpose? The report on the validity of the Florida FSA test was released today. There were findings from six studies conducted by Alpine Testing Solutions and EdCount LLC.
Given that there were six sets of results and conclusions, the researchers did not state that the test as whole was valid. Let’s look at each study’s conclusions. Even more interesting are the results of the analyses taken together and called Cross Study Conclusions.
The study found reasons to be cautious about some of the uses of the scores. The researchers reported that the Florida DOE did not intend to report scale scores or performance standards this first year. Students would receive a percentile ranking, the number of points earned out of the number possible, and the average number of points statewide by category. Interim cut scores for ELA grade 3 and Algebra I would be set by percentile equating based on the FCAT cut scores.
The final report can be found here.
Here are the key points from each topic studied:
- EVALUATION OF TEST ITEMS. While the test construction procedures were in accordance with professional practice, test items from the Utah field test were not consistently aligned with Florida Standards. These items should be phased out. Florida should conduct an independent study of all items to ensure they are aligned with the standards. (If they are not aligned, scores should not be reported as performance measures of the standards.) Florida should conduct studies of how students engage with the questions. (Their approaches to critical reading and problem solving need to be determined to ensure that the questions are eliciting the appropriate thinking processes.)
- EVALUATION OF FIELD TESTING. Item statistical review procedures were appropriate, but the documentation was incomplete.
- TEST BLUEPRINT AND CONSTRUCTION. Documentation of procedures was not complete. The cognitive complexity of questions were not aligned with the test blueprints. Therefore, the complexity of questions could vary across forms. In addition, the complexity of questions might not align with the individual standard for which they were written. Due to the time constraints, there was inadequate documentation of procedures. In regard to test consequences and the corresponding review of score reporting materials, insufficient evidence was provided.
- TEST ADMINISTRATION. Test administration was problematic…it did not meet the normal rigor and standardization expected of a high stakes assessment program like the FSA.
- SCALING, EQUATING AND SCORING. Procedures were appropriate but documentation is incomplete.
- SPECIFIC PSYCHOMETRIC CONCERNS. While the equating of the grade 10 ELA from the FSA to the FCAT was non standard, it was an acceptable solution. The limitations need to be clearly outlined for passing standards not set through a formal standard setting process. (Passing standards on a more difficult test may require fewer items correct than on an easier test to which it was equated. The meaning of the scores related to proficiency is compromised.)
The validity of the use of scores for particular purposes must be based on an analysis of the impact of the six studies above. The conclusions with regard to individual student scores and group scores related to teacher and school grades have some limitations as shown in the cross study conclusions.
CROSS STUDY CONCLUSIONS
Paper and pencil administration scores can be used for individual students. Computer based FSA test scores should not be used as a sole measure for promotion, graduation, or placement in remediation courses. The possible impact of group scores on specific situations where there were a number of students with invalid tests should be evaluated. (For example, a teacher evaluation based on a group of students’ scores that had compromised test administrations could be invalidated.)
ITEM CONTENT MATCH WITH STANDARDS The conclusions of the individual six studies are limited by the lack of available data due to the short time frame. Substantial information about test content was available. Due to time constraints, item reviews for bias, complexity, and content were done initially by Florida DOE staff. The researchers conducted a separate independent review of how well the test items were aligned with the standards. If the panelists of content specialists were not able to match items with the correct standard, then the scores related to the standard who not have meaning.
There was a range from 65% to 76% of ELA items across grade levels that corresponded to the appropriate standards. The highest percentage was for grade six. In mathematics, the range of agreement across grade levels for items and standards was 79% to 94%. The highest match rate for math items and standards was for the grade four test.
COGNITIVE COMPLEXITY RATING MATCH. The FSA was designed to be a test of critical thinking and problem solving skills. The researchers evaluated the cognitive complexity of test questions on a one to four scale and compared their ratings to the intended level of complexity established by the Florida DOE. The intended complexity level was called Depth of Knowledge (DOK). The results were:
DOK ratings were slightly lower than intended in math because somewhat more items were intended to reflect level 2 but were rated level 1. Thirty-six percent of the math items were rated below the intended level.
DOK ratings were slightly higher expected in ELA because many were intended to reflect level 3 but were rated at level 2. Thirty seven percent of the ELA items were rated above the intended level.
In spite of the difference in ratings between DOE and the independent panelists, the researchers claim that there was strong agreement on the ratings. Nevertheless, the researchers qualified their recommendations on the use of the scores based on test administration, item content ratings, a problematic test administration, and the lack of data for scaling, equating and scoring.