Texas’ annual reading test adjusted its difficulty every year, masking whether students are improving

Texas’ annual reading test has shown flat performance from 2012 to 2021, despite increased spending on K-12 education, due to its design that adjusts difficulty level every year.
The test’s norm-referenced design means it assesses students relative to their peers, rather than a fixed standard, making it impossible to determine if students are improving or not.
The test’s scoring system is designed to keep scores flat, with practices like omitting easier questions and adjusting scores to cancel out gains due to better teaching, which hides the impact of increased spending on education.
The design of the test affects not just students but also schools and communities, as high-stakes test scores determine school resources, accreditation, and even home values.
A bill passed by the Texas Senate in May 2025 aims to eliminate the STAAR test, but it’s unclear if this will address the underlying issues with the test’s design, which may continue to mask student performance due to lack of substantive revisions to scoring calculations.

Millions of Americans take high-stakes exams every year. Caiaimage/Chris Ryan/iStock via Getty Images

Texas children’s performance on an annual reading test was basically flat from 2012 to 2021, even as the state spent billions of additional dollars on K-12 education.

I recently did a peer-reviewed deep dive into the test design documentation to figure out why the reported results weren’t showing improvement. I found the flat scores were at least in part by design. According to policies buried in the documentation, the agency administering the tests adjusted their difficulty level every year. As a result, roughly the same share of students failed the test over that decade regardless of how objectively better they performed relative to previous years.

From 2008 to 2014, I was a bilingual teacher in Texas. Most of my students’ families hailed from Mexico and Central America and were learning English as a new language. I loved seeing my students’ progress.

Yet, no matter how much they learned, many failed the end-of-year tests in reading, writing and math. My hunch was that these tests were unfair, but I could not explain why. This, among other things, prompted me to pursue a Ph.D. in education to better understand large-scale educational assessment.

Ten years later, in 2024, I completed a detailed exploration of Texas’s exam, currently known as the State of Texas Assessments of Academic Readiness, or STAAR. I found an unexpected trend: The share of students who correctly answered each test question was extraordinarily steady across years. Where we would expect to see fluctuation from year to year, performance instead appears artificially flat.

The STAAR’s technical documents reveal that the test is designed much like a norm-referenced test – that is, assessing students relative to their peers, rather than if they meet a fixed standard. In other words, a norm-referenced test cannot tell us if students meet key, fixed criteria or grade-level standards set by the state.

In addition, norm-referenced tests are designed so that a certain share of students always fail, because success is gauged by one’s position on the “bell curve” in relation to other students. Following this logic, STAAR developers use practices like omitting easier questions and adjusting scores to cancel out gains due to better teaching.

Ultimately, the STAAR tests over this time frame – taken by students every year from grade 3 to grade 8 in language arts and math, and less frequently in science and social studies – were not designed to show improvement. Since the test is designed to keep scores flat, it’s impossible to know for sure if a lack of expected learning gains following big increases in per-student spending was because the extra funds failed to improve teaching and learning, or simply because the test hid the improvements.

Why it matters

Ever since the federal education policy known as No Child Left Behind went into effect in 2002 and tied students’ test performance to rewards and sanctions for schools, achievement testing has been a primary driver of public education in the United States.

Texas’ educational accountability system has been in place since 1980, and it is well known in the state that the stakes and difficulty of Texas’ academic readiness tests increase with each new version, which typically come out every five to 10 years. What the Texas public may not know is that the tests have been adjusted each and every year – at the expense of really knowing who should “pass” or “fail.”

The test’s design affects not just students but also schools and communities. High-stakes test scores determine school resources, the state’s takeover of school districts and accreditation of teacher education programs. Home values are even driven by local schools’ performance on high-stakes tests.

Students who are marginalized by racism, poverty or language have historically tended to underperform on standardized tests. STAAR’s design makes this problem worse.

On May 28, 2025, the Texas Senate passed a bill that would eliminate the STAAR test and replace it with a different norm-referenced test. As best as I can tell, this wouldn’t address the problems I uncovered in my research.

What still isn’t known

I plan to investigate if other states or the federal government use similarly designed tests to evaluate students.

My deep dive into Texas’ test focused on STAAR before its 2022 redevelopment. The latest iteration has changed the test format and question types, but there appears to be little change to the way the test is scored. Without substantive revisions to the scoring calculations “under the hood” of the STAAR test, it is likely Texas will continue to see flat performance.

The Texas Education Agency, which administers the STAAR tests, didn’t respond to a request for comment.

This article was updated to include new bill passed by Texas Senate.

The Research Brief is a short take on interesting academic work._

Jeanne Sinclair receives funding from the Social Science and Humanities Research Council (SSHRC) of Canada.

link

Q. Why did Texas’ annual reading test scores remain flat from 2012 to 2021?

A. The test’s difficulty level was adjusted every year, which masked whether students were improving.

Q. What is a norm-referenced test, and how does it affect student performance?

A. A norm-referenced test assesses students relative to their peers, rather than if they meet a fixed standard. This means that success is gauged by one’s position on the “bell curve” in relation to other students.

Q. Why did many of the author’s bilingual students fail the end-of-year tests despite showing progress?

A. The tests were unfair and designed to keep scores flat, which meant that a certain share of students always failed due to their position on the bell curve.

Q. What is the purpose of the STAAR test in Texas’ educational accountability system?

A. The test determines school resources, the state’s takeover of school districts, accreditation of teacher education programs, and even home values.

Q. How does the STAAR test design affect marginalized students?

A. The test makes this problem worse by perpetuating the idea that certain groups are inherently less capable due to factors like racism, poverty, or language barriers.

Q. What is the author’s concern about the latest iteration of the STAAR test?

A. There appears to be little change to the way the test is scored, which means that Texas will likely continue to see flat performance.

Q. Why did the Texas Senate pass a bill to eliminate the STAAR test and replace it with a different norm-referenced test?

A. The bill was passed on May 28, 2025, but it’s unclear if this would address the problems uncovered in the author’s research.

Q. What is the author planning to investigate next?

A. The author plans to investigate if other states or the federal government use similarly designed tests to evaluate students.

Q. Why did the Texas Education Agency not respond to a request for comment on the STAAR test?

A. The agency did not respond, which suggests that they may be aware of the issues with the test but are choosing not to address them publicly.