I’m Trish, and I’m a quantophile. The potential of quantitative research to answer seemingly unanswerable questions about our innermost experiences is what first attracted me to psychology. And over a decade later, I’m still in love. But the rose tint is definitely starting to wear off as I see more and more examples of balanced judgement replaced with mindless quantification.
If you have had any formal training in research methods you will have learned that all measurements should be both reliable and valid. Reliability involves types of consistency. If Doctor 1 and Doctor 2 both administer a structured interview to assess your mental health they should agree on whether you are clinically depressed, and at what level of severity. You can even use some statistical wizardry to put a number on how ‘reliable’ your measurement is. Your friends and classmates will gasp in wonder as numbers pour satisfyingly into your output file!
Establishing validity is a lot trickier than reliability. It taps into what our measurements and results mean. This can be particularly tricky when we try to measure a new construct such as “Mental Health Literacy”. For example, in a study by Aromaa, Tolvanen, Tuulari and Wahlbeck published in 2011 agreement with the statement “Antidepressants have plenty of side effects” was used to measure “personal stigma” in relation to depression. Endorsement of this statement was assumed to reflect a lack of “realistic” views about medication. And yet the Royal College of Psychiatrists’ position statement on antidepressants released earlier this year states that reactions to antidepressants can range from “an overall improvement in levels of depression and quality of life, to feeling the benefit of functioning better while suffering adverse side effects, to finding them ineffective with intolerable and harmful side effects” . So a lot hinges on what exactly is implied by the word “plenty”. It seems a pretty soft subjective bedrock on which to build a firm, objective science of attitude change intervention. The association for psychological science states that there are “more than 280 different scales for assessing depression” in current use. To paraphrase the (possibly apocryphal) statement by Einstein if the scale worked, one would be enough.
Even if we do manage to measure our constructs of interest accurately, there are still many pitfalls to beware of in their analysis and interpretation. For example the hallowed ‘p’ value. You probably remember that a p value is the chance that your results were a ‘fluke’. It isn’t though. Nor does it tell you how important your results are or anything about your effect size. The p value is so widely misunderstood that Haller and Kraus in 2002 administered a quiz testing 6 frequent misunderstandings of the p value and found that not only did 100% of students sampled make at least one mistake, but so did 80% of instructors.
Pain is a subjective, complex and mysterious experience, but one that we nonetheless must strive to measure in some objective way for research to take place. We use tools like visual analog scales that allow us to rate pain numerically, by placing a mark on a line, or with smiley faces. These all tap into to relative and continuous nature of pain by asking us to relate our current pain to an absolute absence of pain or “the worst pain imaginable”. This makes intuitive sense. Since pain does not correlate perfectly with tissue damage or any other objectively observable physical signal the research attempting to assess the validity of these pain scales usually relies on reliability assessments (people rate previous pain experiences similarly over time, vulnerable to all the biases in recall), and the level to which they are affected by pain relief. Since the pain assessment tools are often developed in order to reliably test pain relief strategies there is a certain circular logic at play that I find very irritating while being asked repeatedly to rate my labour contractions on a scale of 1- 10 to establish that my epidural has not been effective.
It is unrealistic to think we can get by currently without rating scales in psychology or medicine, but we should be aware that a tool that might help with research or provide a useful aggregate with which to compare groups may not be the most useful tool to connect with and understand the person in front of us.
Sense about science are publishing a Data Science Guide to help the public critically evaluate the sea of seemingly meaningful numbers we are bombarded with on a daily basis. Their advice, when looking at claims based on data analysis of any kind, is to always ask yourself:
- Where does it come from?
- What is being assumed?
- Can it bear the weight being put on it?
Sound advice. So, how does psychology as a discipline measure up in applying this advice? A systematic review of 433 scales reports that around 50% of them cited no evidence to support validity whatsoever (see para 8). Like the lego pain scale, it seems we are relying on face validity. I would love to hear your thoughts- is this a problem for psychology?
Baby Frazer, by the numbers Birthweight: 4.33 kg Overdue by: approximately 252 unusually long hours Pain caused: 8-9 on the ‘lego’ scale