How do psychologists get data to test their theories? The traditional method was to recruit a sample of undergraduate students from their university. While this was very convenient, it had one obvious disadvantage: undergraduate students aren’t necessarily representative of the population – let alone all humanity.
In recent years, therefore, psychologists have turned to online survey platforms – the most popular of which is Amazon’s Mechanical Turk. How does it work? Individuals sign up to be ‘MTurk workers’ and can then be recruited to participate in academic studies. They are paid a small fee each time they participate in one. Specifically, they are paid a minimum of $0.01 per ‘assignment’, with a typical study involving many ‘assignments’.
MTurk has become extremely popular among psychologists and other academics. Over 40% of the articles in some journals are based on MTurk data, so we’re talking about thousands of studies. The problem is that most of these studies may be flawed.
Why?
A growing body of evidence indicates that MTurk data is of very low quality, due to the high percentage of MTurk workers who are careless responders. By ‘high percentage’ I mean upwards of 75%. Careless responders may click options randomly, or they may engage in what’s called ‘straightlining’ where they click the first option that appears for each successive question. Both types of responding yield data that is worthless.
In a new preprint, Cameron Kay provides a particularly vivid illustration of the problems with MTurk. His methodology was simple: he recruited a sample of MTurk workers and gave them 27 ‘semantic antonyms’. These are pairs of items that ought to yield opposite answers: for example, “I am an extravert” and “I am an introvert”, or “I talk a lot” and “I rarely talk”.
If most respondents are paying attention and taking the study seriously, the pairs of items will be negatively correlated. There’ll be a tendency for people who agree with “I am an extravert” to disagree with “I am an introvert”. By contrast, if most respondents are straightlining, the pairs of items will be positively correlated. And if most respondents are clicking randomly, the pairs of items won’t be correlated at all.
What did Kay find? 26 out of 27 pairs of items were positively correlated. In other words, there was a tendency for respondents who agreed with “I am an extravert” to agree with “I am an introvert” – complete nonsense. This is shown in the chart below.
To see whether the data could be salvaged, Kay re-ran the numbers after excluding 47% of respondents who failed one or more checks for careless responding: they answered faster than 2 seconds per item; they responded the same way to half the items in a row; and/or they answered incorrectly to items where the answer was given (e.g., ‘choose disagree’). Even after excluding nearly half the respondents, 24 out of 27 items were still significantly positively correlated – as shown in the light-orange bars above.
Interestingly, when he gave the same semantic antonyms to a sample recruited from CloudResearch Connect, another online survey platform, they had the expected negative correlations. These are shown in the blue bars above. So the good news is that the problem of careless responding appears to be specific to MTurk, rather than one that afflicts all online survey platforms.
CloudResearch Connect workers are typically paid more than their counterparts on MTurk and hence may be less inclined to respond carelessly (although in Kay’s study, the two samples were both paid the equivalent of $8 per hour).
There has been much talk of a ‘replication crisis’ in psychology and other disciplines, which has been attributed to a combination of publication bias and questionable research practices. We may soon hear about a ‘second replication crisis’, stemming from the overreliance on MTurk data.
To join in with the discussion please make a donation to The Daily Sceptic.
Profanity and abuse will be removed and may lead to a permanent ban.