COVID-19 Data Analysis Update - Statistical Analysis
For each of the countries we survey, we distinguish different periods of pandemic development based on the respective growth rates for the number of infections recorded. We believe that we can establish general statistical regularities. At the beginning, the growth rate is typically extremely high (dark red), but then weakens. In the final saturation phase, the growth rate becomes so low (turquoise) that the development of the epidemic is essentially under control. Various countries are currently at different stages of development. In countries in which the growth rate is still very high, as is currently the case in Germany, it must be expected that a saturation phase will only occur after much higher case numbers.
In the following section we describe the figures, which are available when selecting a country from the forecast table.
Case progression + Case progression (logarithmic)
The first two graphs show, in linear and base-2 logarithmic scale, the number of infected, deceased and recovered patients over time and also calculate the number of active cases as the difference between infected and recovered or deceased cases. On a logarithmic scale exponential growth will appear linear, as is the case for most countries which have not reached the saturation phase. As saturation is reached the number of infected cases in a base-2 log scale will flatten out (as can be seen in the graph of South Korea).
Daily infections + Daily deaths
Regarding the data for Germany:
In the logarithmic case progression there are two jumps, which can likely be explained by changes to the testing-systematic or method and do not necessarily reflect a real spike in the actual cases.
The next two graphs show the number of daily new infections and new deaths on a linear scale. Keep in mind that infected cases only reflect detected cases, which are highly dependent on the availability of tests and the testing strategy (broad testing will produce far less unreported cases). Also due to the exponential nature of epidemic spreading even a constant or only slightly increasing number of new infections is a positive sign, as the base number of infected people is still growing during the unsaturated phase.
Logarithmic growth rate trend + Arithmetic growth rate trend + Double-logarithmic growth rate trend
The next three graphs are constructed on the theoretical basis of SIR (the simplest compartmental model, where individuals can either be susceptible to the disease, infectious, or recovered/deceased) or SIS (where no long-lasting immunity occurs and one can only be susceptible or infectious) type models. For a detailed explanation on the use of these models, please read the mathematical background section. These models imply that the logarithmic growth rate of infections depends linearly on the infected cases. A linear regression line (in blue, with standard deviation indicated by dashed parallel lines in red) compares the logarithmic and arithmetic growth rates over the number of infections. In order to account for the fact that the growth rates differ between countries, we have inserted colored horizontal in graphs 5 through 7 that indicate growth rate levels. Therefore, we can see for each country in which region its growth rate has been and currently is. Large deviations from this straight line may indicate problems or systematic changes in data collection. The flatter the blue line is, the slower the epidemic weakens. A linear extrapolation looks for the intersection of this line with the horizontal axis to determine how many infections are to be expected in total. This extrapolation is then used as a basis for the infection forecast.
As one can see from these graphs, however, a linear regression does not always work well. An obvious reason may be the fact that an epidemic infection typically spreads rapidly with a time lag of 6-7 days. This leads to an exponential stretch of the corresponding coordinate axis. Therefore, the seventh graph also employs a logarithmic scale for the infection numbers. This reveals that it is highly country-dependent to what extent linear regression is useful.
For predicting the further development of the epidemic in each country, it seems important to determine the point when the growth rate falls below the yellow line (r<0.1). If the figures from China (although these data are probably systematically distorted) and South Korea can be transferred, the final number of cases will be about two and a half times higher than at this point in time. Of course, this is only a very rough estimate with many uncertainties, and not a reliable prognosis. Before this point in time, it is probably not possible to make any reasonably reliable forecasts. And it is also currently not clear to what extent the findings from East Asian countries can be generalized to others. In particular, the actual development will also depend on the measures taken or to be taken to contain the epidemic and their implementation and compliance by the population.
So, we do not simply extrapolate the current growth rates in order to predict, for example, how quickly the number of infections will double. If current growth rates were maintained, practically the entire population would be infected in most countries within a short time. Instead, we try to capture regularities in the change in the growth rate. In general, it seems to be the case that after a strong initial phase, the growth rate slows down and the epidemic finally passes into a saturation phase, where there are relatively few new infections. Our statistical goal is to estimate when this will happen and what the total number of infections will be by then.
Death rate time development
Regarding the data for Germany:
For the double-logarithmic growth rate trend (where each data point represents a day, but the horizontal axis does not represent days but the logarithm of the infection numbers) there are two downward spikes in the last weeks. We presume that these are the due to the nationwide protective measures and regulations put in place, which manifest after a lag time of 7-10 days. However the growth rate remains approximately at this level and does not decrease further. This has to be assessed as an unfavorable development.
The last figure indicates the death rate over time. Again, the reporting may vary from country to country as it is not always possible to distinguish patients that die from Corona from those that die with it. It is also possible that death rates are systematically underreported for purposes of political propaganda.
Notice on data integrity
We would like to point out some aspects of the data situation that have emerged from our analyses. At the beginning of the epidemic, strong fluctuations and deviations from the regression line can be seen in every country. This is simply due to the small case numbers. In the Chinese data, you suddenly see a sharp jump in the middle. However, this does not seem to be due to such a large sudden increase in the actual number of cases, but rather to a change in data collection. Actually the chinese data fails a statistical normality test and is thus not trustworthy. We presume that the actual number of infections and deaths are substantially, possibly even a magnitude, higher than the official data.
The test density and the classification of the test results vary greatly from country to country, so that the numbers of infected persons cannot be easily compared. Many infected persons are therefore not recorded, and the proportion varies from country to country. It is also possible that in certain countries the official data is falsified by political manipulation. Even the number of deaths reported may vary between countries, as patients with pre-existing conditions may not be counted as COVID-19 victims. Perhaps in some countries only those who have died in hospitals are recorded. In addition, when interpreting the statistical data, it should be borne in mind that there is typically a longer period of time between infection and death of a patient. Current death figures are therefore correlated with past and less with current infection rates. We also see sudden increases in death rates in some countries, perhaps because their medical systems become overwhelmed.
What needs to be explained are the different death rates in different countries. In some countries, such as Germany or the Scandinavian countries, they are quite low, while in Italy in particular they are very high. There are various probable reasons for these differences. Firstly, different population groups may be affected by infections, which includes mainly older people in Italy, while it predominantly affected tourists returning from skiing holidays in central and northern Europe. Also, large sporting events or festivals may have triggered waves of infection in some countries. Secondly, it is possible that in some countries the infection rates are significantly underestimated. Thirdly, it may be that, as already mentioned, not all deaths are recorded in all countries, for whatever reason. As depicted separately, we realize a disturbing general trend in the rising mortality rate during the course of the epidemic.
The numbers of recovered patients are probably not accurate either, because hospital discharges are often not reported to the authorities and those who have recovered at home will usually not report either. It is therefore possible that the epidemic is already under control before the official number of active cases reaches zero.
In particular, for these reasons, the figures officially reported from China, where the epidemic is claimed to be under control, must be used with caution when forecasting the development in other countries.
The spread can also be very different, because the social contact networks through which infections occur can be very heterogeneous. In South Korea, the virus appears to have spread mainly within a religious sect, within which contacts were very high, so that it spread rapidly there, while contacts with the outside world were much lower, so that the infection could be essentially confined to this group. In China, the epidemic was essentially limited to one province, Hubei, by strictly prohibiting and preventing all external contacts. In the Scandinavian countries, we see two peaks in the number of infections, which indicates that there have been two different waves of spread. Either, similar to South Korea, the infection initially spread only within a certain group and then only later affected other segments of the population, or there was a second wave of infection independent of the first. In other countries, festivals, sports matches or other major events may also have caused a sudden worsening of the epidemic. Network propagation models must therefore take particular account of network heterogeneity.