We’re all knackered. You’ve all been teaching forever and I’ve visited approximately 1000 schools a week since I become self-employed last November. What I want to do right now is talk to my family, watch the Big Bang Theory, drink some beer and then sod off to France in a couple of weeks and go climbing. The last thing I wanted to do this evening was write a blog.
But then the DfE published this research into the reception baseline.
https://www.gov.uk/government/publications/reception-baseline-research
I skipped the first document (55 pages), speed read the next one, and wasn’t going to bother with the third. It basically sounded like one of those police officers at the scene of an accident: “nothing to see here. Move along.” But I thought I should make the effort. It’s only 12 pages long after all.
And I’m very glad I did. In amongst the flannel and whitewash was this:
The research noted the difference between the scores of the two groups – the teaching & learning group and the accountability group – with the latter having lower scores, suggesting that perhaps when tests are administered for purposes of establishing a baseline for measuring progress (i.e for accountability reasons) lower scores are given.
Then they appear to have let their guard down.
Read paragraph 3 in the screenshot above:
“The overall result would be statistically significant at the 95% level if the data were from an independent random sample.”
Hang on! What?
Is the data significant? Or isn’t it?
It would appear that the use of a 95% confidence interval is not appropriate in this case because the data is not from a random independent sample. So it is significant at the 95% level but that test is not used due to the nature of the sample. Quite rightly they employ a more appropriate test.
But significance tests in RAISE are carried out using a 95% confidence interval. Either this means that cohorts of pupils are independent random samples or the wrong test is used in RAISE.
This is something that Jack Marwood, myself and others have been trying to get across for a while – that there isn’t a cohort of pupils in England (or maybe anywhere for that matter) that can be considered to be an independent random sample.
Not one.
So if the DfE decides to use a different test for significance in this research on the grounds that the samples are not independent and random, then shouldn’t they do the same in RAISE?
Until cohorts of children are true independent, random samples, does this mean we can discount every blue (and green) box in our RAISE reports?
Well, perhaps not – that would be rather foolhardy. In an email exchange with Dave Thomson of FFT today, he stated that the tests used in RAISE are useful in that they indicate where there is a large deviation from the national mean and significant data should be treated as the starting point for a conversation. He did then point out that no cause can be inferred; that statistical significance is not evidence of ‘school effects’ and that it should not be treated as a judgement.
So, there is some disagreement over the significance of the sentence (pun intended) but I’m still left wondering why a test that is not appropriate here, is deemed appropriate for other data that is neither random nor independent.
That sentence may not change everything as I rather excitedly claimed last night, but it does pose big questions about the validity of the tests used in RAISE. This reads like an admission that statistical significance tests applied to groups of pupils are flawed and should be treated with extreme caution. Considering how much faith and importance is invested in the results of these tests by those that use them to judge school performance, perhaps we need to have a proper conversation about their use and appropriateness. It is certainly imperative that users understand the limitations of these data.
So, thank you DfE, in one sentence you’ve helped vindicate my concerns about the application of statistical significance tests in RAISEonline. An unexpected end of year gift.
Have a great summer!
Subscribe to receive email updates when new blog posts are published.
Ah sweet naivety – things are significant when they support the ideology and not when they do not …
Thanks. Yes. True.
I think readers should consider that you and Jack Marwood are fairly isolated in your stance.
The real world almost never allows truly random selection in the test group, so the analyist always has to think about how a test group or school might differ from the comparison to understand what the figures mean. This of cause reduces the certainty about conclusions but means results are informative, it our job to figure out how informative.
Please ask around but it is really a matter of interpretation how strict you want to be about how random and independent you need sample to be before you throw away the whole dataset saying its contaminated. There always some bais creaping into samples, so if you idenify any you can always say they are non random and be right, but at some point you have to be practical and learn what you can with messy data. FFT, DfE and every policy statisitcan I have ever known would agree, I think you should consider why so few pros agree with Jack Marwood.
I'm not trying to get significance tests ditched – not my intention – I'm just trying to rattle some cages. Personally I think they are useful to a point, but dangerous too considering how much importance is placed on them. Had some good chats about this with FFT recently, and Dave Thomson has written an excellent blog on the subject:
http://www.educationdatalab.org.uk/Blog/July-2015/Significance-tests-for-school-performance-indicato.aspx#.VbZiu4qkrCQ
Essentially I want people to understand what they mean: a deviation from the mean that probably did not happen by chance. You cannot infer school effect and they therefore should not be used to judge schools. They are an alert, a starting point for a conversation. Nothing more.
So, I'm just trying to RAISE a little awareness about their limitations and potential for misuse because I don't want to hear things like "you can't be a good school with blue boxes in your RAISE report". The success or failure of a school could hinge on misinterpretation of data.
Yes, significance indicators are useful up to a point. Isn't it worth trying to ensure that everyone using data to evaluate the performance of schools understands that a blue or green box is not necessarily indicative of school effect?
Surely even the pros would agree with that.
Dave Thomson's blog is excellent, its the reason am back in the discussion. I'd given up arging with Jack Marwood and no other stato seemed to be bothered.
The risk that significance tests will be over interpreted is a problem of implementation and data literacy not one of whether they should be used.
Data literacy needs to improve and significance tests are entry level and most elemental concepts in data analytics, so we have to start somehwere.
I am most concerned that this debate has charactersied OFSTED, DfE and FFT statisiticains as the bad guys, incompetent and worse that they don't understand many important things about education that they should do. That sets up an us against them vibe when you of all people know us data people really really really just want to help and think we can contribute a lot.
Again, fair point and I hear you. I don't think DfE/FFT statisticians are bad guys. The problem does indeed lie with data literacy. I repeat: I don't think significance tests should be ditched, just need to improve knowledge of what they can and can't tell us. Schools are getting hammered on basis of spurious interpretation due to lack of data literacy. This needs to change.