When the fact of testing matters more than the results or the test itself

 

What counts as evidence of the validity of social programmes? Validity in this context isn’t always clearly defined but may be determined by the answers to a number of questions, including: is the programme reaching the people it is intended to help? Can we demonstrate an improvement in people’s lives? To what extent can the programme credibly claim to be the reason for that improvement?

These are difficult questions to answer. People will come to a social programme with a history, and leave with a future, but the programme itself isn’t the bridge over which they pass from one to the other. To stick with this metaphor now that I’ve started on it, most social programmes will be only part of the bridge, a section of brickwork, a pier, a girder, some grouting perhaps, or even the statuary that adorns the parapet. This makes it hard to pinpoint exactly the role that any given programme has played in supporting someone over a stretch of water. The programme may have been essential, and load-bearing, or it may have been merely decorative (and anything in between).

There are two ways to respond to this kind of difficulty. The first is to aim for strict causality through experimental or quasi-experimental studies that are intended to demonstrate the specific ways in which a programme is load-bearing rather than decorative. This is to take the bridge as a collection of component parts which must each prove their singular worth, rather than a structure of interdependent elements that work together and that, when taken apart, aren’t really a bridge at all.

This is where social programmes can fall into a ‘method trap’. It becomes of the utmost importance to demonstrate that one’s programme is a load-bearing pile. Funding depends upon it. It may be almost impossible to do in practice because doing so requires large samples, long trials and matched control groups. Nevertheless the best causal method, the randomized controlled trial, becomes the ‘gold standard’ for testing load-bearing-pile-ness. Practical difficulty will mean that in all likelihood the inputs to the RCT will be flawed, and the outputs statistically insignificant and hard to analyse. In consequence inputs and outputs are put to one side, and what is taken to matter above all is testing as a guarantor of intent. It becomes more important to be able to say ‘I have tested the load-bearing nature of my pile using the gold standard’ than to examine the findings of the test itself, and whether the test was the right test in the first place.

The social sector is not alone in falling into this kind of trap, where the use of a given method becomes a stand-in for well-designed and appropriate research with usable results. Here are two more. The first is the use of p-values, as ably recounted here by Regina Nuzzo. P-values test a null hypothesis as a ‘straw man’ for a given piece of research. Essentially they tell you the likelihood of your research results being produced entirely randomly, but they give no information about the underlying probability of an effect in the first place. Despite this significant gap, p-values have become a handy sifting mechanism for assessing publishable articles. There’s even an urban dictionary term, p-hacking, that describes fiddling with the inputs until you get the desired ‘significance’ for your p-value. P-hacking is thought to explain the glut of p-values around the 0.05 significance mark. A method and its fiddled result stand in for a more messy but accurate reality.

Another example is pointless A/B testing in tech start-ups. A/B testing involves sending two versions of an email and looking for differences in opened emails, unsubscribes and clicks in order to tailor your marketing. Doing this effectively requires a very large sample size, much larger than most start-ups would usually have. A second stage in this process can even involve producing a doubtful p-value for your statistically insignificant click-rate, thereby cumulating poor methodology. Once again, running the tests is more important than understanding them and checking the validity of the results.

The second way of analysing social programmes – seeing the bridge as a single structure with interdependent elements – shows more appreciation for the complexity of bridge engineering. Considering the history that each person brings to their crossing is equally important, perhaps drawing on statistical models – such as Bayesian inference – that consider prior probabilities rather than null hypotheses. In short, now may be a good moment for the social sector to opt for a wider perspective and contemplate the whole crossing rather than treating individual elements as if they were separable.

Genevieve Maitland Hudson is a researcher and consultant. She works with the consultancy Osca.


Comments (0)

Leave a Reply

Your email address will not be published.