How Replicable is the Imperial College Model?

By Sue Denim

After Toby published my first and second pieces, Imperial College London (ICL) produced two responses. In this article I will study them. I’ve also written an appendix that provides some notes on the C programming language to address some common confusions observed amongst modellers, which Toby will publish tomorrow.

Attempted replication. On the June 1st ICL published a press release on its website stating that Stephen Eglen, an academic at Cambridge, was able to reproduce the numbers in ICL’s influential Report 9. I was quite interested to see how that was achieved. As a reminder, Imperial College’s Report 9 modelling drove lockdown in many countries.

Unfortunately, this press release continues ICL’s rather worrying practice of making misleading statements about its work. The headline is “Codecheck confirms reproducibility of COVID-19 model results”, and the article highlights this quote:

I was able to reproduce the results… from Report 9.

This is an unambiguous statement. However, the press release quotes the report as saying: “Small variations (mostly under 5%) in the numbers were observed between Report 9 and our runs.”

This is an odd definition of “replicate” for the output of a computer program, but it doesn’t really matter because what ICL doesn’t mention is this: the very next sentence of Eglen’s report says:

I observed 3 significant differences:
1. Table A1: R0=2.2, trigger = 3000, PC_CI_HQ_SDOL70, peak beds (in thousands): 40 vs 30, a 25% decrease.
2. Table 5: on trigger = 300, off trigger = 0.75, PC_CI_HQ_SD, total deaths: 39,000 vs 43,000, a 10% increase.
3. Table 5: on trigger = 400, off trigger = 0.75, CI_HQ_SD, total deaths: 100,000 vs 110,000, a 10% increase.

In other words, he wasn’t able to replicate Report 9. There were multiple “significant differences” between what he got and what the British Government based its decisions on.

How significant? The supposedly minor difference in peak bed demand between his run and Report 9 is 10,000 beds, or roughly the size of the entire UK field hospital deployment. This supports the argument that ICL’s model is unusable for planning purposes, although that’s the entire justification for its existence.

Eglen claims this non-replication is in fact a replication by arguing:

although the absolute values do not match the initial report, the overall trends are consistent with
the original report

A correctly written model will be replicable to the last decimal place. When using the same seeds and same input data the expected variance is zero, not 25%. Stephen Eglen should retract his “code check”, as it’s incorrect to claim a model is replicable when nobody can get it to generate the same outputs that other people saw.

Number of simulation runs. ICL have contradicted themselves about how Report 9 was generated. Their staff previously claimed that, “Many tens of thousands of runs contributed to the spread of results in report 9.” In Eglen’s report we see a very different claim. He explains some of the difference between his results and ICL’s by saying:

These results are the average of NR=10 runs, rather than just one simulation as used in Report 9

Imperial College’s internal controls are so poor they can’t give a straight accounting of how Report 9 was generated.

The point of stochasticity is to estimate confidence bounds. If incorporating random chance into your simulation changes the output only a bit, you assume random chance won’t affect real world outcomes much either and this increases your confidence. Report 9 is notable for not providing any confidence bounds whatsoever. All numbers are given as precise predictions in different scenarios, with no discussion of uncertainty beyond a few possible values of R0. None of the graphs render uncertainty bounds either (unlike e.g. the University of Washington model). The lack of bounds would certainly be explained if the simulation was run only once.

People working on the ICL model have argued the huge variety of bug reports they received don’t matter, because they just run it repeatedly and average the outputs. This argument is nonsense as discussed repeatedly, but if they didn’t actually run it multiple times at all then the argument falls apart on its own terms.

Models vs experiments. The belief that you can just average out model bugs appears to be based on a deep confusion between simulations and reality. A shockingly large number of academics seem to believe that running a program is the same thing as running an experiment, and thus any unexplained variance in output should just be recorded and treated as cosmic uncertainty. However, models aren’t experiments; they are predictions generated by entirely controllable machines. When replicating software-generated predictions, the goal is not to explore the natural world, but to ensure that the program can be correctly tested, and to stop model authors simply cherry-picking outputs to fit their pre-conceived beliefs. As we shall see, that is a vital requirement.

Does replication matter? It does. You don’t have to take my word for it: ask Richard Horton, editor of the Lancet, who in 2015 stated:

The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness. As one participant put it, “poor methods get results”.

Alternatively ask Professor Neil Ferguson, who is a signatory to this open letter to the Lancet requesting retraction of the “hydroxychloroquine is dangerous” paper because of the unreliability of the data it’s based on, supplied by an American health analytics company called Surgisphere. The letter justifies the demand for retraction by saying:

The authors have not adhered to standard practices in the machine learning and statistics community. They have not released their code or data.

ICL should give the authors the benefit of the doubt – maybe Surgisphere just need a couple of months to release their code. They are peer-reviewed experts, after all. And statistics isn’t a sub-field of epidemiology, so according to Imperial College spokespeople that means Ferguson isn’t qualified to criticise it anyway.

Initial response and the British Computer Society. Via its opinion writers, the Daily Telegraph picked up on my analysis. ICL gave them this statement:

A spokesperson for the Imperial College COVID-19 Response Team responded to criticism of its code by saying the Government “has never relied on a single disease model to inform decision-making”.
“Within the Imperial research team we use several models of differing levels of complexity, all of which produce consistent results. We are working with a number of legitimate academic groups and technology companies to develop, test and further document the simulation code referred to. However, we reject the partisan reviews of a few clearly ideologically motivated commentators.“

The first bolded statement is typically misleading. In the SAGE publication from March 9th addressing lockdowns, the British Government was given the conclusions of the SPI-M SAGE subgroup in tables 1 and 2. On page 8, that document states the tables and assumptions are sourced to a single paper from ICL which has never been published, but from the title and content it seems clear that it was an earlier draft of Report 9. There is no evidence of modelling from any other institution contributing to this report, i.e. it doesn’t appear to be true that the Government has “never” relied on a single model – that’s exactly what it was fed by its own advisory panel.

The second bolded statement is merely unfortunate. By ideologically motivated commentators they must have meant the vast array of professional software engineers who posted their reactions on Twitter, on GitHub and on this site. The beliefs of the vast majority in the software industry were summarised by the British Computer Society (BCS), a body that represents people working in computer science in the UK. The BCS stated:

Computer code used to model the spread of diseases including coronavirus “must meet professional standards” … “the quality of the software implementations of scientific models appear to rely too much on the individual coding practices of the scientists who develop them”

Is Imperial College going to argue that the BCS is partisan and ideologically motivated?

On motivations. It’s especially unfortunate when academics defend themselves by claiming their critics – all of them, apparently – are ideological. Observing that coding standards are much higher in the private sector than in the academy isn’t even controversial, let alone ideological, as shown by the numerous responses from academics agreeing with this point, and stressing that they can’t be expected to produce code up to commercial standards. (They “need more funding”, obviously.)

But in recent days people have observed that “for months, health experts told people to stay home. Now, many are encouraging the public to join mass protests.” The world has watched as over 1,200 American epidemiologists, academics and other public health officials published an open letter which said: “[A]s public health advocates, we do not condemn these gatherings as risky for COVID-19 transmission …. this should not be confused with a permissive stance on all gatherings, particularly protests against stay at home orders.”

According to “the science” the danger posed by this virus depends on the ideological views of whoever is protesting. This is clearly nonsense and explains why Imperial College administrators were so quick to accuse others of political bias: they see it everywhere because academia is riven with it.

To rebuild trust in public science will require a firm policy response. As nobody rational will trust the claims of academic epidemiologists again any time soon, as the UK’s public finances are now seriously damaged by furlough and recession, and as professional modelling firms are attempting to develop reliable epidemic models themselves anyway, it’s unclear why this field should continue to receive taxpayer funding. The modellers with better standards can, and should, advise the Government in future.

Appendix: Common errors when working with C/C++. This section is meant only for modellers. Non-modellers or programmers already familiar with these languages should stop reading here.

The C/C++ programming languages are unlike most others. It’s apparent from talking to some modellers that this isn’t sufficiently clear. Some believe that the impact of bugs (any bugs) is always likely to be small relative to errors in assumptions, which isn’t the case. An academic working in molecular biology wrote an open letter in response to my analysis, arguing that the ICL fiasco is the fault of software developers for not putting warning labels on C++:

It’s you, the software engineering community, that is responsible for tools like C++ that look as if they were designed for shooting yourself in the foot. It’s also you, the software engineering community, that has made no effort to warn the non-expert public of the dangers of these tools. Sure, you have been discussing these dangers internally, even a lot. But to outsiders, such as computational scientists looking for implementation tools for their models, these discussions are hard to find and hard to understand.

Blaming professional software engineers for disasters caused by untrained academics is hardly a helpful take, especially given attacks on “armchair epidemiologists“, yet the problem he identifies is clearly a real one. Very few scientists work with C/C++. They mostly prefer to use R or Python. These languages are far better choices and don’t suffer the problems I’m about to outline, but are less efficient than C++. If you want something efficient yet safe, try exploring a more modern language like Kotlin for Data Science.

As these articles have been seen by a lot of scientists, I’ll now provide some quick explanations meant for that audience. If you’re a scientist working with C/C++ be aware of the following things:

Firstly and most critically, if at all possible don’t use these languages. They are designed for efficiency above all else. The ICL COVID-Sim program has several cases of so-called memory safety errors. Beyond data corruption, memory safety errors can create security vulnerabilities that could lead to your institution getting hacked. Google employs some of the best C++ programmers in the world and has a large industrial infrastructure devoted to catching memory safety errors. Despite that they routinely ship exploitable bugs to the Chrome userbase. To stop this leading to the sort of “code red” security disasters that were routine in the early years of the 20th century, they built a complete firewall around their own code (the “sandbox”) that assumes they have in fact failed and which tries to contain any subsequent attacks. They also moved to silent security upgrades that users cannot control. In this analysis the Chrome engineering team show that around 70% of all Chrome security bugs are related to memory safety and explore moving away from using C++.
A few modellers who commented believed that memory safety errors would surely cause the program to crash, so if it didn’t crash when producing Report 9, any bugs must have been introduced afterwards. This isn’t the case. Crashing is an intentional process started by the operating system when a violation of system rules is detected, indicating internal corruption inside a program. It is a best effort process because the OS cannot detect every possible corruption. If a bug causes a variable to be incorrectly set to zero and you then divide by it, your program will crash because dividing by zero is impossible. If the bug incorrectly sets the variable to anything else the division might succeed and yield an invalid result. Likewise, allocating a list with five elements and trying to read the tenth will reliably crash in almost every language except C and C++, in which the program is not checking list indexes for performance reasons. If you read the 10th element of a 5 element list and the OS doesn’t detect that, your variable will be set to an arbitrary value.
Memory safety errors do not yield uniformly random values. As I’ve repeatedly stressed, some modellers appeared to believe that memory safety errors don’t matter if you average the results. An out of bounds read like this bug is far more likely to yield some values than others, for example, 0, 1, -1, INT_MAX, INT_MIN and pointers into a heap arena or stack frame. You have no idea what it’ll be and cannot predict it, so don’t try.
Memory safety errors open up what’s called “undefined behaviour“. The compiler is allowed to assume your program has no memory safety errors in it, even though that may be hard to achieve in practice for large programs. It may change your program in complex ways before you run it, based on that assumption. For example, the compiler may silently delete parts of your code, including important parts like security checks. Obviously if the program you’re actually running silently skips a step in your model the results are scientifically meaningless, even if they may look plausible.

A major understanding gap between the software industry and academic science appears to be caused by this last point. Once your program contains undefined behaviour you cannot reason about what it will do or whether the outputs are correct. Common sense logic like “it looks right to me” means nothing because something as trivial as an overnight operating system update could cause the results to change totally. Any chance of reliably replicating your results goes out the window. To a software engineer, a program with memory safety errors could do literally anything at all, which is why it’s seen as pointless to argue about whether such a bug has a significant effect.

Here are two more issues that can bite you when working with C/C++:

The default random number generators are often too weak for scientific purposes. COVID-Sim attempted to use its own RNG to solve this, but it was also buggy, so that probably just made things worse. If you need a fast source of pseudo-random numbers for Monte Carlo techniques, use an open source RNG pre-written for you and which has been run against a battery of statistical tests. Mersenne Twisters work if you’re careful but a better algorithm to use is Xorshift+. For example this article gives implementations of xorshift and splitmix. Treat your RNG with care especially when splitting the stream in multi-threaded contexts. If you’re working with something safer like Java or Kotlin on the JVM, SplittableRandom exists to help you.
When doing floating point calculations in parallel, don’t add the result of each thread to a shared variable at the end of the loop. This can (a) cause lost writes if you forget to use an interlocked exchange and (b) can cause non-deterministic runs due to the non-associativity of floating point arithmetic.

Because so many C/C++ specific bugs are avoidable with experience, if you do decide your research needs the performance C or C++ offers you should attempt to gain funding to hire a software engineer who has worked with these languages for several years. Resist the temptation to go it alone: you risk the reputation of your institution by doing so.

To join in with the discussion please make a donation to The Daily Sceptic.

Profanity and abuse will be removed and may lead to a permanent ban.

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

transmissionofflame

1 year ago

“Like Britain, the Netherlands, Germany and Sweden, France has had difficulties assimilating the children of immigrants from beyond Europe”

Well, where should the primary responsibility lie? With the nation or with the immigrant? I would argue the latter. The failure IMO is that the nation should have recognised before it allowed this to happen that “assimilation” of such large numbers in such a short space of time was neither possible nor desirable.

Reply to transmissionofflame

When there are lots of Immigrants from the same country of origin, they tend to settle in colonies where they can bring their culture with them, ie, assimilation/ integration doesn’t happen. If this was otherwise, the outcome could hardly be a multi-cultural society or rather, loads of culturally different societies whose members don’t usually mingle with outsiders. Eg, there are Caribbean barbers in Reading who refuse to serve white people. I tried this once and got unceremoniously thrown out immediately. It’s possible to buy stuff in Indian supermarkets but this can occasionally get a little tight because the only other white people who go their belong firmly into the white trash category and buy nothing except cheap booze. Nobody who knows these kinds of shops would consider them English. Even Polish supermarkets remain decidedly Polish (they’re a great source of vegetables usually not consumed by English people like parsley roots, salsify or white aparagus) and people who obviously aren’t can expect some odd looks.

Marcus Aurelius knew

Reply to RW

My experience with Polish supermarkets in the UK is the same as with Poles in the UK – they work hard to integrate themselves – and succeed. After all, their culture is very similar to UK culture – any differences are subtle.

-1

Reply to Marcus Aurelius knew

After all, their culture is very similar to UK culture – any differences are subtle.

A pretty bizarre statement. English people are usually more-or-less agnostic or atheist protestants, Poles are mostly devout Catholics. There are also many whose English is poor to non-existant as they often don’t really need it. There are really many more fundamental differences. This goes to the point that Polish housewives go into Polish supermarkets whose customers are predominantly other Poles, where the most frequently spoken language is Polish, in order to buy imported Polish carrots and potatoes there (so do I, for that matter, as the two common English potato varieties – marbles and bowling balls – are not very suitable for the use I want to put them to). Many English people are also just as fond of Poles as they’re fond of travellers¸ ie, not at all.

-12

EppingBlogger

Is it those of Polish or UK origin who you say can’t speak English. Gotta ask!

Reply to EppingBlogger

I have no idea. Probably Polish. I just noticed that some things are very difficult to get accross with some of the women working in the supermarket I’m thinking of.

Nearhorburian

The differences between Englishmen and Poles are clearly trivial compared to those between Englishmen and Roma, Englishmen and Somalis and Englishmen and Afghans.

After all, the Poles who remained in England after WW2 were very easily assimilated, and on several occasions I’ve heard Poles in Corby Asda speaking to their pre-school children in English.

Reply to Nearhorburian

They’re probably smaller. But this American notion that so-called white people are basically a homogenous mass is still completely wrong-headed. This may be true for the USA after a century of cross-breeding (although I doubt that). It’s certainly not true in Europe. I’ve met second-generation descendant of Polish jews in England who were certainly completely English[*] but that’s very much different for the more recent influx from the EU. These people are Polish and clearly intend to remain so.

[*] Minus the fact that the not-so-much-of-a-lady immediately started a minor quarrell — more a heated discussion – with me because I was German and she was an Jew with a Polish heritage.

Last edited 1 year ago by RW

duplicate

Last edited 1 year ago by Marcus Aurelius knew

RumpoMidwinter

True – sheer volume per minute precludes assimilation in the best of circumstances; but there is another obstacle – the degree to which people are visibly and / or habitually alien to the natives, in ways which cannot be ironed out or ignored. I submit that the large scale immigration of groups who are visibly and physically different will inevitably – and in ANY society – produce ghettoes and division. The Chinese might assimilate Koreans; they could never assimilate – let us say – Nigerians, Moroccans or Aboriginals. Similarly, European societies can never assimilate the vast numbers of wildly different populations foisted upon us by the self-hating, Utopian left. The best it might have produced was a polyglot market place – a miserable environment in itself. But we aren’t getting even that. Lastly, and most heretically of all, different populations are more or less peaceful; more or less productive; more or less self-sufficient; more or less clannish; more or less disdainful or antagonistic to their hosts, because – surprise, surprise – they have histories and general characteristics too! Orientals have caused us no problems at all. North Africans are currently terrorising France, Belgium, Holland and western Europe; sub-Saharan Africans weigh heavily in the stats of petty crime, welfare dependency and occasional violence. And so on. The truth which is emerging – too late, as so damn frequently, is that the Powellites were spot on and that their opponents were a sordid blend of cowardice and malice. As the lament goes for the fall of England’s monasteries: “Walsingham, ah, farewell!” For Walsingham, read Europe – wantonly poisoned and over-run with the connivance of the vicious reds.

-2

Reply to RumpoMidwinter

Spot on

Mogwai

I hardly think lack of integration or alienation are a satisfactory explanation ( although they are issues that need addressing in and of themselves ) for the atrocities described above. Getting behind the wheel of a truck with the intention of mowing down and killing as many civilians as possible is not something done on the spur of the moment. It’s premeditated and well planned. The act of somebody truly wicked. There are too many fanatics that are so far gone mentally that they will justify every heinous act they perform because their psychopathic, paedophile Prophet urges them to do so in the Koran, therefore it’s all good. The French government are the enemy of the people. They prioritize welcoming unlimited psychopaths ( obviously this doesn’t apply to all immigrants but how well are these people vetted? ) into the country over the citizens’ safety and welfare. Just another recent example. Would you like a radical Islamist, supposedly ”rehabilitated” by the courts, flying your plane?

”They are fanatics but are there legally: you can’t tell them anything!
They supposedly arrive to do “the jobs that the citizens don’t want to do.”
Then, their children who arrived with family reunification have other children who become citizens by right of the soil, in France, at the age of eighteen.
And then they start to become policemen, politicians… and airplane pilots.
And you can’t stop Mohammed, an Islamic fanatic, from becoming a pilot, because our insane anti-discrimination laws forbid it.
And then, we find them at the controls of Air France planes.”

https://www.jihadwatch.org/2023/07/france-court-reinstates-muslim-migrant-commercial-airline-pilot-with-radical-beliefs-approaching-sharia

-5

George L

Listen to the lizard Spectre and watch her eyes.. she tells you out front what’s going to happen to Europe and who’s going to be responsible for carrying out the Kalergi plan. Everything else is a distraction..

https://www.youtube.com/watch?v=G45WthPTo24

gavinfdavies

If rats invade your hen house and start breeding, the pups do not magically become chickens. And your eggs keep going missing and your chickens dying.

-3

Reply to gavinfdavies

Good one gavin.. I’m avin that.. with your kind permission of course..

LaptopMaestro

Too much islam is destined to end badly and bloodily – they are not civilised people.

RTSC

The African Muslim immigrants refuse to become French.

France is a secular nation. Religion plays no part in its Governance or Laws. Religious symbols are prohibited in State-run functions, including schools.

There are no blasphemy laws; no faith is protected from criticism or ridicule.

Muslim immigrants do not accept this – in France – or elsewhere. They refuse to become French, just as many refuse to become British.

UK has had no “difficulty” assimilating children of immigrants. The very idea that it should be done was explicitly excluded by the Blair policy of multi cultural use which all other Westminster parties have endorsed and continued. There is no intention or wish to integrated (assimilate) and it has rarely happened.

We Brits are entitled to call out our elites who have condemned the public for having racist ways when it is the country they must admire that has the problem.