I want to take you on a journey – a journey into data. What you take from this is up to you – perhaps you’ll gain a healthy scepticism of statistical data presented by experts; perhaps you’ll gain an understanding of types of tricks that might be used to massage data to favour a particular outcome; or perhaps you’ll even decide that government experts are far more likely to be right than a random person on the internet. All that I ask is that you come on the journey with me.
The journey starts with a statement made by outgoing Mayor of the city of New York, Bill de Blasio:
We want everyone, right now, as quickly as possible, to get those boosters. And we’re going to make it even better for you with a new incentive and an incentive that is here just in time for the holidays. As of today, we’re announcing a $100 incentive for everyone who goes out and gets a booster here in New York City, between now and the end of the year.
The former Mayor seems to think that there’s no doubt that the vaccines, and the booster vaccine in particular, offer the way out of the Covid epidemic. Let’s see what the data say.
Taking the Data at Face Value
A quick exploration of data sources finds that New York city publishes a fairly decent collation of all of their Covid case data, and one document in particular contains data on cases and case rates by vaccination status – do the data on cases in the fully vaccinated versus the boosted support the Mayor’s faith in the booster?
That’s actually rather a surprise – even the NYC official data on case rates by vaccination status show that while the vaccine booster might have once offered some protection against infection with Covid, by late January the protection had been lost and in recent weeks case rates in those given the vaccine booster have been approximately double that of the fully vaccinated but not boosted. It isn’t clear why this is the case – perhaps it is the BA.2 variant of Omicron showing increased vaccine escape; perhaps it is the vaccine protection waning – all we know is that the boosted appear to be far more likely to get infected than the fully vaccinated.
These data suggest that de Blasio really should warn people that NYC official data suggest that in taking his $100 they’re going to be twice as likely to become infected, twice as likely to infect a vulnerable person and will be helping to make each Covid infectious wave far more serious.
But of course, I’m being naughty here – I’m guilty of omitting relevant data. How do the unvaccinated fare?
Ouch. Perhaps the former Mayor should have focused only on the unvaccinated for his $100 bribe – these data suggest that the sweet-spot is to be fully vaccinated.
But wait – my spidey-senses are tingling…
The Date of the Data Updates
A quick look at the data for the unvaccinated versus the vaccinated in the graph above shows signs of a simple statistical pitfall. For most of the data points there’s a general agreement in the trend between unvaccinated, fully vaccinated and boosted – they generally all go up together and all go down together, even if there are differences of scale. However, the most recent data points (far right) buck the trend, with the data for the vaccinated showing only a slight rise (below the trend), but the unvaccinated a significant jump upwards (above the trend). This reminds me of a study I was involved in many years ago, where the early data that came in supported the goals of the study, but as the rest of the data came in we discovered that the rest of the data didn’t – our early data was biased. Has the New York infection data suffered a similar fate?
Fortunately, the NYC data is hosted on ‘Github’, a platform that not only hosts the data but also gives information on each and every update to the data – thus we can explore how data are updated after their initial addition to the database. To explore this aspect of the data we can’t use the most recent data (as there’s been no opportunity to update them yet), but we can look for evidence of the effect in prior weeks’ data – I’m going to pick two weeks as examples, data for the week of March 12th and March 19th:
Let’s start with the data for March 19th (on the left, above). When the data for this week were originally published it was stated that there were approximately 3,200 cases in those vaccinated with two or more doses and 2,000 cases in the unvaccinated. However, in the following week’s publication the data for the week of the 19th had been updated, to approximately 3,800 cases in those vaccinated with two or more doses and 1,400 cases in the unvaccinated. A glance at the update two weeks after the original publication shows that the figures had found their level – after that first week’s update there is little change in the data. The data for the week of March 12th (on the right, above) confirms the effect – exactly the same pattern is seen.
It is worth pointing out explicitly two aspects of their updating the data in this way:
- The NYC authorities are almost certainly only reporting the most recent data in their headline statistics and in their releases to the press. Thus they will be exaggerating the case rate in the unvaccinated and understating the case rates in the vaccinated. Sure, the update might be reported somewhere, but who wants to go trawling around in the data-mud when all the headline data are so easily found?
- There is only a week’s delay between the original publication of the data and the first update but this is where the bulk of the correction occurs. The data have already been delayed by two weeks before publication, so waiting an additional week to allow the data to settle down to their final state would seem the obvious choice. Are they deliberately publishing the data early to mislead the public?
But wait – I can see that the data shown above is hiding another problem in plain sight…
The Unstated Assumption
If you look at the bar charts of our two example data update periods you should see that in their updates they don’t actually add many new cases – the main impact of the update is to transfer cases from the unvaccinated into the vaccinated. This raises obvious questions, such as what on earth is going on?
When working with data there are all sorts of assumptions, most of which will be stated somewhere. For example, in the datasets we’ve been working with they define a case of Covid not as ‘all cases’ (as you might expect) but only as ‘cases in those aged over five’. The reasons for this definition of ‘case’ in the data are reasonable enough, but the important point is that they’re stated explicitly in the accompanying documentation. However, there are also unstated assumptions, and these are potentially dangerous. In scientific studies it is very important that all efforts are made to turn these into stated assumptions, so that readers (and team members) know exactly what assumptions are being made. This is often why scientific documents are very dry and boring – all of the ‘obvious’ assumptions really need to be tied down, to remove any source of error arising from different interpretations.
In the case of the NYC data and updates, the data quite clearly show that there are few additional new cases brought into the data set after the initial data release; rather, the update mainly moves cases from the unvaccinated group to the vaccinated groups. This is presumably a result of linking each case to the city’s vaccination database.
It very much looks as though there is an unstated assumption – that the vaccination status of all individuals is ‘unvaccinated’ unless proven otherwise.
Beyond this, it is also likely that there is a deeper issue. Even after the vaccination database links have all been found there will almost certainly be some cases apportioned to the unvaccinated group where the individual concerned was actually vaccinated (i.e., it simply wasn’t possible to link his or her status to the vaccination database). The magnitude of this effect is completely unknown (as they don’t offer the data), but in the U.K. the ‘uncertain status’ cases made up about 5% of all cases and I’m sure New York City would be similar.
It goes without saying that cases where an individual became infected within 14 days of becoming fully vaccinated will almost certainly also be counted as unvaccinated.
There’s yet another complication with this new assumption about what constitutes ‘unvaccinated’ – the NYC data includes both cases and the case rate per 100,000, so we can calculate the numbers that they think are in the unvaccinated, fully vaccinated and boosted groups. If we do this and add them all together we get a number that is consistently about 7,000,000. This is far short of the approximately 7,800,000 that the city authorities claim live in the city and are aged over five. However, the city’s own data on the percentage partially vaccinated suggest that these number approximately 700,000-750,000 – suspiciously close to our shortfall. This suggests to me that the team counting the cases simply don’t speak with the team computing the case rates…
All this talk of unclear vaccination status highlights another problem with these data – just what has happened to the partially vaccinated?
The Unstated Assumption II
The particular data set we’re working with only reports cases and case rates in the unvaccinated, the ‘fully vaccinated’ and the boosted, but not the partially vaccinated. This in itself is fine, even if it would be better to have been given all the data.
A quick note on definitions – ‘fully vaccinated’ in the U.K. is easy to define, as it is those having taken two doses of vaccine (possibly after seven or 14 days). In the U.S. this means those that have completed their vaccine course, which means two doses for some vaccines, but only one dose for others. Thus in the USA there is a group called ‘partially vaccinated’ which is made up of those that have received only one dose of a vaccine that requires two doses to be fully vaccinated.
However, the data remain troublesome. The first publication of the data appears to have some cases declared as ‘unvaccinated’ which are then moved into the fully vaccinated or boosted groups. But surely there will also be among the cases of ‘unknown status’ (but counted as unvaccinated) some who are partially vaccinated. The result of this is that we should see the cases that become identified as being in the partially vaccinated being removed from the data set, as the data set does not include a category for partially vaccinated. This means the first week’s update should see a reduction in total cases. Instead, we see that the data actually show a slight rise. What is going on here?
What we really need is another set of data giving total case numbers – that way we can remove the total given in our original data set (unvaccinated, fully vaccinated and boosted) to get data for those that are only partially vaccinated. This is made complicated by the fact that the data we’ve been working with are only for those aged over five years of age; most of the available data sets available to us are for the whole population. Fortunately, the NYC data page does offer case rates by age. As we know that the population they’re working with is approximately 8,337,000 and that approximately 6.3% of the population of New York City is under five, we can calculate the total number of cases in the over five population from those data and compare with our data set to obtain infection numbers in the partially vaccinated.
Well, that’s another huge surprise. Even though our data set purports not to include cases in the partially vaccinated, there’s just nowhere else for them to be – our estimates of total cases are very nearly equal to another data set that does include cases in the partially vaccinated. A quick check with the NYC data on total vaccinations given suggests that around 750,000 people are partially vaccinated – thus cases in this group should be large enough to be obvious in the graph above.
And so we return to the definition of ‘unvaccinated’ in our data set. It very much looks like a case in the ‘unvaccinated’ is actually defined as: Any case where we can’t prove that the person was either fully vaccinated or boosted. Or perhaps: Cases in those that are unvaccinated, partially vaccinated and where we just don’t know their vaccination status
That last definition of ‘unvaccinated’ brings us to another problem – exactly how many people are going to be in this group? The New York health authorities will have a fairly good idea of how many people in the city have been vaccinated, and we know that they assume that there are around 8,337,000 people in the city when doing their rates calculations – but how many people are actually in the city. The latest U.S. census estimates that there were approximately 8,804,000 people living in the city in 2020 – really we need to use this number when calculating the unvaccinated population in New York city (it will probably be higher than this now, but the 2020 census estimate will be close enough).
In terms of our calculation of case rates for NYC by vaccination status, our original data set’s estimates of case rates in the fully vaccinated and boosted is probably accurate enough (apart from the most recent week’s data), but case rates in the unvaccinated should really be called the ‘not fully vaccinated or boosted’ and the denominator should be the total population aged five and over that aren’t fully vaccinated or boosted.
We’ve gone a long way on this journey – we’ve seen evidence in NYC’s official Covid data that suggests that:
- The most recent week’s data are unreliable, significantly overstating cases in the unvaccinated and understating cases in the vaccinated.
- Just one week after the original data release they update the data – this moves a substantial number of cases from the unvaccinated group to the vaccinated group.
- Cases are probably defined as ‘unvaccinated’ unless proven to be in the fully vaccinated or boosted.
- That means ‘unvaccinated’ includes the partially vaccinated.
- And it almost certainly also includes cases that can’t be linked to the vaccination database, whether this be because forms weren’t filled in correctly, people vaccinated out of state or people vaccinated without paperwork (such as the homeless or undocumented migrants).
- And it almost certainly includes people within 14 days of becoming fully vaccinated.
- Finally, for Covid purposes the city assumes its population is about 8,300,000, even though everyone else, including the U.S. census, thinks that the population is about 500,000 higher.
Let’s update our original graph of case rates by vaccination status with all the information we’ve gained along the way.
What a difference a little investigation makes – it isn’t that we’ve managed to add a small correction to their data, but that we’ve found evidence to support the actual rate in the ‘unvaccinated’ being substantially different to the official estimated case rate, even dipping below the case rate in the boosted in recent weeks. Note that we’ve not estimated the case rate in the unvaccinated, but in a strange group called ‘everyone not fully vaccinated or boosted’, which also includes every case where they couldn’t link the individual to their vaccination database – our estimate of case rates is almost certainly higher than would be found simply in the unvaccinated and partially vaccinated alone.
And so we can return to where we started our data-journey – Bill de Blasio’s generous offer to give $100 from the taxpayers of New York City to each person being boosted. You can take what you want from our travels, but perhaps you might agree with me that the data used by the city authorities to support their Covid strategy are nowhere near as robust as they believe, and that the $100 strategy is likely to make things worse.
I’ll finish with a question – how come any data analysis that shows the vaccines to be performing badly, making things worse or simply not necessary at all are ignored by the popular press, while no-one questions official data even when they’re full of holes?
Amanuensis is an ex-academic and senior Government scientist. He blogs at Bartram’s Folly.