There follows a guest post by Daily Sceptic contributing editor Mike Hearn about the ongoing problem of apparently respectable scientific journals publishing computer-generated ‘research’ papers that are complete gibberish.
The publisher Springer Nature has released an “expression of concern” for more than four hundred papers they published in the Arabian Journal of Geosciences. All these papers supposedly passed through both peer review and editorial control, yet no expertise in geoscience is required to notice the problem:
The paper can’t decide if it’s about organic pollutants or the beauty of Latin dancing, and switches instantly from one to the other half way through the abstract.
The publisher claims this went through about two months of review, during which time the editors proved their value by assigning it helpful keywords:
If you’re intrigued by this fusion of environmental science and fun hobbies, you’ll be overjoyed to learn that the full article will only cost you about £30 and there are many more available if that one doesn’t take your fancy, e.g.
- Distribution of earthquake activity in mountain area based on embedded system and physical fitness detection of basketball
- Detection of rare earth elements in groundwater based on SAR imaging algorithm and fatigue intervention of dance training
- Detection of PM2.5 in mountain air based on fuzzy multi-attribute and construction of folk sports activities
- Characteristics of heavy metal pollutants in groundwater based on fuzzy decision making and the effect of aerobic exercise on teenagers
Peer-reviewed science is the type of evidence policymakers respect most. Nonetheless, a frequent topic on this site is scientific reports containing errors so basic that any layman can spot them immediately, leading to the question of whether anyone actually read the papers before publication. An example is the recent article by Imperial College London, published in Nature Scientific Reports, in which the first sentence was a factually false claim about public statistics.
Evidence is now accruing that it’s indeed possible for “peer reviewed” scientific papers to be published which have not only never been reviewed by anybody at all, but might not have even been written by anybody, and that these papers can be published by well known firms like Springer Nature and Elsevier. In August we wrote about the phenomenon of nonsensical “tortured phrases” that indicate the usage of thesaurus-driven paper rewriting programs, probably the work of professional science forging operations called “paper mills”. Thousands of papers have been spotted using this technique; the true extent of the problem is unknown. In July, I reported on the prevalence of Photoshopped images and Chinese paper-forging efforts in the medical literature. Papers are often found that are entirely unintelligible, for example this paper, or this one whose abstract ends by saying, “Clean the information for the preparation set for finding valuable highlights to speak to the information by relying upon the objective of the undertaking.” – a random stream of words that means nothing.
Where does this kind of text come from?
The most plausible explanation is that these papers are being auto-generated using something called a context-free grammar. The goal is probably to create the appearance of interest in the authors they cite. In academia promotions are linked to publications and citations, creating a financial incentive to engage in this sort of metric gaming. The signs are all there: inexplicable topic switches half way through sentences or paragraphs, rampant grammatical errors, the repetitive title structure, citations of real papers and so on. Another sign is the explanation the journal supplied for how it occurred: the editor claims that his email address was hacked.
In this case, something probably went wrong during the production process that caused different databases of pre-canned phrases to be mixed together incorrectly. The people generating these papers are doing it on an industrial scale, so they didn’t notice because they don’t bother reading their own output. The buyers didn’t notice – perhaps they can’t actually read English, or don’t exist. Then the journal didn’t notice because, apparently, it’s enough for just one person to get “hacked” for the journal to publish entire editions filled with nonsense. And finally none of the journal’s readers noticed either, leading to the suspicion that maybe there aren’t any.
The volunteers spotting these papers are uncovering an entire science-laundering ecosystem, hiding in plain sight.
We know randomly generated papers can get published because it’s happened hundreds of times before. Perhaps the most famous example is SCIgen, “a program that generates random Computer Science research papers, including graphs, figures, and citations” using context-free grammars. It was created in 2005 by MIT grad students as a joke, with the aim to “maximize amusement, rather than coherence“. SCIgen papers are buzzword salads that might be convincing to someone unfamiliar with computer science, albeit only if they aren’t paying attention.
Despite this origin, in 2014 over 120 SCIgen papers were withdrawn by leading publishers like the IEEE after outsiders noticed them. In 2020 two professors of computer science observed that the problem was still occurring and wrote an automatic SCIgen detector. Although it’s only about 80% reliable, it nonetheless spotted hundreds more. Their detector is now being run across a subset of new publications and finds new papers on a regular basis.
Root cause analysis
On its face, this phenomenon is extraordinary. Why can’t journals stop themselves publishing machine-generated gibberish? It’s impossible to imagine any normal newspaper or magazine publishing thousands of pages of literally random text and then blaming IT problems for it, yet this is happening repeatedly in the world of academic publishing.
The surface level problem is that many scientific journals appear to be almost or entirely automated, including journals that have been around for decades. Once papers are submitted, the reviewing, editorial and publishing process becomes handled by computers. If the system stops working properly editors can seem oblivious – they routinely discover they published nonsense only because people who don’t even subscribe to their journal complained about it.
Strong evidence for this comes from the “fixes” journals present when put under pressure. As an explanation for why the 436 “expressions of concern” wouldn’t be repeated the publisher said:
The dedicated Research Integrity team at Springer Nature is constantly searching for any irregularities in the publication process, supported by a range of tools, including an in-house-developed detection tool.
The same firm also proudly trumpeted in a press release that:
Springer announces the release of SciDetect, a new software program that automatically checks for fake scientific papers. The open source software discovers text that has been generated with the SCIgen computer program and other fake-paper generators like Mathgen and Physgen. Springer uses the software in its production workflow to provide additional, fail-safe checking.
A different journal proposed an even more ridiculous solution: ban people from submitting papers from webmail accounts. The more obvious solution of paying people to read the articles before they get published is apparently unthinkable – the problem of fake auto-generated papers is so prevalent, and the scientific peer review process so useless, that they are resorting to these feeble attempts to automate the editing process.
Diving below the surface, the problem may be that journals face functional irrelevance in the era of search engines. Clearly nobody can be reading the Arabian Journal of Geosciences, including its own editors, yet according to an interesting essay by Prof Igor Pak “publisher’s contracts with [university] libraries require them to deliver a certain number of pages each year“. What’s in those pages? The editors don’t care because the libraries pay regardless. The librarians don’t care because the universities pay. The universities don’t care because the students and granting bodies pay. The students and granting bodies don’t care because the government pays. The government doesn’t care because the citizens pay, and the citizens DO care – when they find out about this stuff – but generally can’t do anything about it because they’re forced to pay through taxes, student loan laws and a (socially engineered) culture in which people are told they must have a degree or else they won’t be able to get a professional job.
This seems to be zombie-fying scientific publishing. Non-top tier journals live on as third party proof that some work was done, which in a centrally planned economy has value for justifying funding requests to committees. But in any sort of actual market-based economy many of them would have disappeared a long time ago.