I have a confession: I am not a statistician.

There, I said it.

(Feels good to get that out in the open.)

I actually have a Ph.D. In Geology. I’m not bragging; it’s more like an admission. As anyone who knows anything about Scientists will testify, us Geologists don’t live on the brightest side of the academic village. I spent 4 years wandering around remote, rocky outcrops of Scotland and Newfoundland with only a sledgehammer for company. That sledgehammer did come in handy during one particularly awkward encounter on a Channel Island headland once, but I digress.

My wife is a statistician. She has a degree in statistics and teaches GCSE and A-Level Maths. The rest of this blog is actually written by her.

I’m kidding.

But it probably should be.

So, I thought I’d have a break from ranting about levels; and new levels that aren’t supposed to be levels but are really; and tracking systems that don’t work; and people who don’t really understand how assessment has changed; and have a little look at statistical significance.

I am now standing slightly outside my comfort zone.

Or should I say my confidence interval?

One of the things I regularly do in my work with schools is isolate the issue. By this I mean I remove the ’causes’ of statistical significance from the data set. My aim is to turn the eye of Sauron away from the data as a whole, towards individual pupils that have, for whatever reason, underperformed; to demonstrate that underperformance of just a few pupils can have an extreme impact on school data. Prepare some case studies on those few pupils and you are on more solid ground.

Obviously it swings both ways and high performance of a few pupils can have a positive effect.

Anyways….

I recently did some work with a very large junior school in London. The school was RI, due an inspection and had dreaded blue boxes for VA. This is a school with over 100 pupils in a cohort but even so, recalculating the VA with just one ‘underperforming’ pupil removed was enough to push the overall VA score above the significance threshold. No more blue box. Just 1 pupil out of 100, that’s all it took. And this was a pupil with a very interesting case study.

If you want to visualise statistical significance and confidence intervals, as used in RAISE, then turn to the first VA page in the RAISE report. You will see the VA score, e.g. 99.3, and a confidence interval of say +/- 0.6. Add that number onto the VA score (99.3 + 0.6) and if the result is still below 100 then the cohort is deemed to be significantly below average, as it is in the example above. If the result is greater than or equal to 100 then it’s below but not significantly below. At the other end, when the score is above 100, subtract the confidence interval. So, for a score of 100.7 with a confidence interval of 0.6, the lower result (100.7 – 0.6) will be above 100 and therefore significantly above. If the original VA score was 100.6 then it won’t be. A difference of 0.1 is all it takes to be ‘significant’. The dreaded blue box or the celebrated green one.

1/10th of a point.

1/20th of a sublevel.

1 pupil having a bad day and falling 2 points short of their VA estimate in a cohort of 20.

Scary.

But it is the misinterpretation of these data and how much importance is placed on them that is even scarier. We commonly hear people saying things like ‘results are significantly below average’ or, even worse, that ‘pupils are making significantly less progress than average’, which is just plain wrong.

So what does it mean? Well, in the case of RAISE, where a 95% confidence interval is used, it means that confidence interval of the sample (i.e. cohort) will contain the population mean in 95% of cases. If a sample’s confidence interval does not contain the population mean (i.e. is in the 5%) then it is implied that the deviation from the mean did not happen by chance.

And there are of course false positives and false negatives where the supposedly significant deviation from the mean has happened by chance.

There are so many things wrong here:

1) it assumes that cohorts of children are random samples of the population. Anyone who knows anything about schools, demographics, catchments and the shenanigans that go on to secure school places knows this is complete fantasy.

2) significance suggests that the deviation from the mean has not happened by chance, but does not tell us the cause. Is it the school’s fault? Is it really down to quality of teaching? Or are external factors having an impact on pupil performance? This is why it’s so important to isolate the issue and get those case studies prepared.

3) False negatives and positives – significant deviations from the mean that happen by chance – do occur. These would be relatively rare if cohorts were drawn at random from the population, but they’re not, so who knows what we’re looking at. It’s a mess.

4) pupils are not retested. A simple analogy: you throw a dice 6 times and the average score is 1.3 against an expected average of 3.5. Is this significant? Do you assume the dice to be faulty? Perhaps there is something wrong with the way you are throwing it. Or perhaps it just happened by chance. Maybe throw it a few more times and test it again. But this is just dealing with a simple thing like a dice, not something complex like a cohort of children with all their inherent variables.

5) a confidence interval is all about uncertainty. There is too much certainty being placed on uncertainty by some people.

I recently attended a small meeting with some people from FFT and it was heartening to hear one of their senior statisticians talk about these issues. He was uneasy about the phrase ‘statistical significance’ because educational significance was all too often inferred from it. We discussed alternatives and it’s a discussion that needs to continue. I don’t have a huge problem with particularly high or low data being highlighted in some way but users need to understand its limitations, and absolutely must realise that statistically significant does not necessarily mean educationally significant.

Perhaps we should move away from the on/off switch of statistical significance; this apparent exactitude of uncertainty.

Perhaps what we need is 50 shades of blue and green.

Perhaps what we need is 50 shades of blue and green.

Subscribe to receive email updates when new blog posts are published.

Its government policy to measure all schools by the same yardstick, as this is the stance they take against supposedly using context as an excuse. That why they cancelled Contextual Value Added. Consequently, arguments against sig testing all school really arguments against policy, not statistics.

Sig testing CVA would take individual schools uniqueness into account.

Oh dear, what a shocking abuse of statistical inference. It seems that whoever imposed this daft method had never heard of the false discovery rate. Perhaps they should watch https://www.youtube.com/watch?v=tRZMD1cYX_c for an easy introduction. Or read the paper on which that's based: http://rsos.royalsocietypublishing.org/content/1/3/140216

And one more thing: tests of significance are valid only for randomised data. When, as In your case, there is no random allocation, tests are bound to find even more false positives.

Thanks for your comment. The more I investigate and learn about this stuff the more shocked I am. It all seems highly dubious, and stakes are very high for schools that are measured and judged on this dodgy data.