Test Scores and Teacher Evals: A Complex Controversy Explained
What started the movement to evaluate teacher performance by test scores?
Has the move to evaluate teachers based on test scores created a backlash?
Are all teachers now evaluated on test scores?
What are valueadded measures or VAM?
Are teachers who don’t receive VAM scores still evaluated based on tests?

Group measures of performance, in which teachers are evaluated based on test scores of students or subjects they don’t teach. A common example is teachers being judged on the entire school’s math or English score even if they teach, say, art. This occurred in Florida, Tennessee, New Mexico, and New York and generated significant controversy.

Student learning objectives (SLOs), in which teachers set goals for student performance on a test, either one they create themselves or a standardized one. The goals are approved by their supervisor, who then assesses the teacher based on how well the students meet those goals. One study of schools in Austin, Texas found no correlation between a teacher’s SLO score and his or her VAM score; while another study in Denver, Colorado found a moderate correlation. These results may be because SLOs and VAMs are assessing different aspects of teacher quality, but they might also call into question whether SLOs are valid measures of teacher performance.
Are there different types of growth models?
VAMs are among the most common. Another common model is known as student growth percentile, which, like VAM, measures student test score growth, but with a different mathematical technique. These models rank students with similar prior achievement based on how much growth they make. Such models, unlike VAM, often do not include controls for student characteristics like poverty, and so may unfairly disadvantage teachers of atrisk students.
Different VAMs also use different variables and demographic factors to create students’ estimated scores. In general, models that account for more student characteristics do a better job of ensuring a level playing field for teachers of academically challenged students.
Some models compare teachers only to other teachers in the same school, though most compare teachers across a given state.
Generally, different models produce at least somewhat similar results.What are some potential uses of VAM?
What are some of the arguments for and against using VAM in teacher evaluation?
Is VAM a valid measure of teacher performance?
Note that ‘validity’ here is used in the statistical sense, meaning a measure’s success in measuring what it purports to measure, meaning in this case teacher effectiveness.
Is VAM reliable?
VAM scores can and do fluctuate from year to year and much of this fluctuation is the result of imprecise measurement (also known as “error”). For example, one study found that 57 percent of teachers who were in the bottom fifth of performance in one year, had moved to another level in the subsequent year — and 8 percent of the bottomlevel teachers were in the top performance category in the following year. In general the correlation from yearto year ranges between .2 (weakly) and .7 (fairly high).^{1}
The reliability tends to be higher for math teachers than for English teachers. Some (but not all) of this instability can be addressed by averaging multiple years of data. The yeartocareer correlation of a given teacher’s VAM is significantly higher — ranging from .55 (medium) to .78 (high) in one study — than the yeartoyear correlation.
Finally, it’s crucial to note that all performance measures have some degree of instability. There is less evidence about the reliability of these alternative measures, but what exists generally suggests principal observations are somewhat more stable over time than VAM — though stability/reliability does not imply validity. In other words, a measure could be consistent over time — like a teacher’s height — but not a very valid one to judge how well that teacher teaches.Note that ‘reliability’ here is used in the statistical sense, meaning a measure’s consistency.
1. In statistical terms a correlation coefficient ranges between 1 and 1. A correlation of 0 means there is no association whatsoever; 1 means a perfect correlation; and 1 means a perfectly negative correlation.
Does using tests for highstakes decisions in teacher evaluation lead to negative unintended consequences? Will it lead to positive consequences?
We don’t know for sure yet, though there’s certainly a possibility that it will, and there is some evidence suggesting both positive and negative outcomes.
There is research showing that holding schools accountable for student test scores has led to cheating and teaching to the test. At the same time, there is evidence that testbased accountability for schools has in many circumstances increased student achievement both on highstakes tests — like the yearly standardized tests — and on lowstakes exams, like the National Assessment of Educational Progress test given every two years.
However, the gains on the lowstakes tests are often not as dramatic as those on the highstakes exams, which gets back to whether teachers are teaching to the highstakes tests or cheating on them.
Schools can adopt policies that reduce cheating and there may be ways of designing tests to make teaching to them difficult.Didn’t the American Statistical Association (ASA) say that VAM should not be used?
Has the use of VAM led to improved results for students?
There have been relatively few studies on how the use of VAM in districts and schools affects students. The few pieces of research that do exist offer both reasons for caution and optimism.

A study found that providing districts with valueadded data did not lead to improved student outcomes (relative to similar districts that did not have access to such data).

A study that offered teachers with high VAM scores a $20,000 bonus for transferring to a highpoverty school produced significant student achievement gains in elementary grades but no effect in middle school.

A study of New York City’s tenure system — which was made more rigorous, partly by using VAM scores — found that the reforms likely led to improvements in teacher quality.

A study in which a group of New York City principals were given VAM scores produced small improvements in student achievement (relative to students of principals who were not given such data).
What do teachers unions say about using test scores in teacher evaluations?
Where can I find additional information about VAM?

Carnegie Knowledge Network on ValueAdded Measures in Education: http://www.carnegieknowledgenetwork.org/

Economic Policy Institute, “Problems with use of student test scores to evaluate teachers”: http://www.epi.org/publication/bp278/

Brookings, “Evaluating Teachers: The Important Role of ValueAdded”: http://www.brookings.edu/research/reports/2010/11/17evaluatingteachers

Brookings, “New Evidence Requires New Thinking”: http://www.brookings.edu/research/papers/2013/10/23valueaddedteacherevaluationdebatekane

American Statistical Association (ASA) Statement on VAM: https://www.amstat.org/policy/pdfs/ASA_VAM_Statement.pdf

Response to ASA Statement: http://obs.rc.fas.harvard.edu/chetty/ASA_discussion.pdf

Shanker Institute, “ValueAdded Versus Observations”: Reliability and Validity

Shanker Institute, “About ValueAdded and ‘Junk Science’”: http://www.shankerinstitute.org/blog/aboutvalueaddedandjunkscience

American Enterprise Institute, “Teacher Quality 2.0”: http://www.joycefdn.org/assets/1/7/fromteachereducationtostudentprogressteacherqualitysincenclb.pdf

Doug Harris, ValueAdded Measures in Education: http://www.amazon.com/ValueAddedMeasuresEducationEveryEducator/dp/1612500005