Standardised tests are extremely useful. They provide an external reference of a pupil’s attainment, indicating whether it is below, above or broadly in line with national average; and give an idea of how pupils are progressing relative to other pupils nationally. They can help us understand what pupil’s do and don’t know whilst providing important test practice; and give common meaning to assessment, allowing us to more reliably compare the performance of cohorts, groups and schools across a range of subjects.
Yes, standardised tests are extremely useful but there are several flaws, misconceptions and limitations that we need to be aware of to make best use of them.
Standardised tests are not necessarily well aligned with the school’s curriculum
Standardised tests assume a certain curriculum coverage* and testing knowledge of what has not yet been taught is likely to result in low scores. These low scores are not a reflection of a pupil’s ability, they simply are a reflection of the fact that they have not sufficiently covered the content of the test and therefore do not compare well to a sample of pupils that have. Low scores resulting from poor curriculum alignment should be treated as false negatives and any measure of progress derived from such unreliable start points should viewed with extreme caution. This is a particular problem with mid-year tests and schools should carefully consider the value of using such products if validity is in doubt.
*unless they are adaptive assessments, which route pupils through the test based on answers to previous questions.
Standardised scores are not the same as scaled scores
The DfE could have chosen any scale they liked for KS1 and KS2 tests. Unfortunately, they chose one that was remarkably similar to standardised scores already used in schools. Standardised tests are norm referenced with pupils’ scores normally distributed about the average score of 100: 50% of pupils nationally have scores of 100 or more and 50% of pupils have scores below 100. KS2 tests are criterion referenced with the pass mark represented by a score of 100. If pupils reach the pass mark in the test they get a scaled score of 100 or more, and around 75% did so in the 2018 reading and maths tests. Unlike standardised tests, the national average KS2 scaled score is not a fixed entity, it changes every year, increasing from 103 in 2016 to 105 in 2018. We must therefore bear in mind that standardised scores and scaled scores are different. We can crudely convert one to the other using percentile rank, where such data exists, but we must be aware that one score does not directly translate into the other.
Some tests are more standardised than others
let’s clear one thing up: pupils’ results are not compared to other pupils that have taken that test that year; they are compared to a sample of pupils that took the test at some point in the past. The sample is selected to be representative of the national population, and by comparing each pupil’s score to the results of the sample we gain an idea of where their attainment places them on the national bell curve. Essentially, it’s a ranking system. Test providers can’t use the results of the actual test to do this because that is a self-selecting sample that may not be truly representative. Even less meaningful would be, say, for a MAT to attempt to produce a standardised score based on test results of pupils within the MAT. It would show the pupil’s position in relation to other pupils in the MAT, but it wouldn’t tell us anything about where that pupil is nationally.
One of the key issues with curriculum tests is the length of time since the sampling was carried out. Because national standards change over time, tests need to be re-standardised regularly to maintain meaning. The longer it has been, the less reliable the data is. Anecdotally, many schools have reported that it is far easier to achieve 100 on some standardised tests than it is to achieve a scaled score of 100 in KS2 SATS in the same subject, which suggests that it is easier to get into the top 50% nationally on the former than it is to get into the top 75% on the latter. That is possible if what is being tested differs greatly from what has been taught, but it’s more likely that the test is indicating that pupils are above the average of a sample taken several years ago and that the sample no longer reflects current national standards.
A change in score is not necessarily a measure of progress
We cannot simply subtract one score from another and call it progress. First, there is the curriculum alignment issue. If we use a mid-year test that is not well aligned with the school’s curriculum then we risk making false inferences about pupils’ ability and claiming increasing scores to be evidence of good progress when in actual fact they are evidence of the increasing validity of subsequent tests. Second, standardised tests scores are not pinpoint accurate, they are spikey, and apparent differences between pupils’ test scores are often likely to be statistical noise. To ascertain whether or not changes in test scores are meaningful we would need to know if those changes are out of the ordinary, which is why some test providers show a confidence interval – a margin of error – around individual pupil scores. Where there is no overlap between confidence intervals of pupils’ consecutive test scores, this may suggest that differences are significant but – and this is important – we cannot be certain. Third, it may be that subsequent tests prioritise different topics so it’s not necessarily evidence of progress but evidence that pupils are more familiar with the content of one test than another. And finally, no change in score indicates that the pupil is maintaining their position nationally and that can be interpreted as evidence of good progress.
Not all apparently equal changes in score are equal
When analysing standardised scores, it is also worth considering the bell curve and the percentiles that underpins them. It may seem logical to interpret a change of, say, 5 points to be the same from any start point but that is not the case. A pupil that scores 70 on one test and 75 on the next has moved from the 2nd to the 5th percentile along the bell curve, whereas a pupil that scores 95 and then 100 has moved from the 37th to the 50th percentile. Meanwhile, a high attaining pupil scoring 140 and 145 respectively remains within the top 1%. This is because there are far more pupils in the centre of our bell curve than there are at the extremes, which is why smaller changes in scores in the middle of the range result in a bigger percentile shifts than larger changes in score at the extreme ends. Not all apparently equal changes are equal.
Standardised scores are generally more useful than age standardised scores
As described above, standardised scores are based on the results of a large, representative national sample of pupils. Age standardised scores, on the other hand, are adjusted to take account of pupils’ age in years and months. A summer born pupil is likely to have an age standardised score that is higher than their standardised score whilst for autumn born pupils it could be the other way round. This is because summer born pupils tend to be lower attaining and despite comparing favourably with other summer born pupils, they may still be below the overall national average. This does not mean that age standardised scores are not useful – they are if you want to demonstrate that attainment of summer born pupils is in line with that of similar aged pupils nationally – but using age standardised scores to track towards a non-age standardised target, such as KS2 SATS, is problematic
There is no standardisation of administration
Unlike KS2 tests and GCSE examinations, commercial standardised tests used in schools are not administered in a consistent way. A MAT that uses such tests to monitor the standards of its schools may not have a strict policy on how they are administered, and even within a school different teachers may approach the tests in different ways. Room layout, time of day, whether breaks and interaction between pupils are permissible – all are factors that can affect an outcome and are variables in the way tests are carried out. We therefore have to ask whether it’s appropriate to compare the results of tests that have been administered under such variable conditions.
Tests are not necessarily a good indicator of knowledge
Length and breadth of tests are key issues. Longer tests are more reliable but are less likely to be completed; shorter tests are more likely to be completed but do not gather as much information. Does the test assess enough in a enough detail to draw conclusions about a pupil’s knowledge? Is it assessing the right content? Are the questions too challenging? are they challenging enough? If there are only five questions relating to fractions on a maths test and a pupil only answers two correctly does that mean they are not very good at fractions? Or had they not covered that topic fully? Or was that pupil having a bad day? Question level analysis at the pupil level needs to be treated with caution and is perhaps not going to tell us as much as we think it will. Of more use perhaps is the aggregated data, which compares the performance of a cohort against national averages at strand or topic level, but even here we must be cautious. How big is the cohort? What is the percentage impact of a right or wrong answer? And, of course, that curriculum alignment issue rears its head again.
To conclude, let’s go back to the start. Standardised tests are extremely useful, but we need to understand the misconceptions and limitations before we can use them effectively. We also need to select the tests we use with care. Do we want paper-based or online tests? Do we really need to use them three times per year? Are they appropriate for all year groups? Does the impact on learning justify the extra workload, especially when it comes to question level analysis? How reliable are they? How well do they align with the curriculum? Are we testing what we should be testing?
Without standardised assessment data we really have no way of knowing where a pupil’s attainment sits in the national picture; and if we want that information, they are indispensable. They are also pretty much essential for reporting meaningful data to governors and external agencies, for tracking attainment gaps between groups of pupils, and for MATS that want to effectively monitor the performance of their schools. They are a powerful tool, but we need to keep in mind that they represent the attainment of pupils on one particular test on one particular day. Results are noisy and any attempt to measure progress will only amplify that noise.
In short, use them but don’t abuse them.
Making Good Progress by Daisy Christodoulou (especially Chapter 5)