COVID-19 Data Analysis Update - Mathematical background

The SIR model

We start by describing the classical SIR model for epidemic spreading [2], although, as we shall see, the model cannot be taken as such to model the current pandemic, but will need some modifications.

Components of the model

We look at the following quantities that depend on time \(t\).

  • \(S_t\): the number of susceptibles at time \(t\),
  • \(I_t\): the number of active infected cases at time \(t\),
  • \(D_t\): the number of death cases at time \(t\),
  • \(R_t\): the number of recovered cases at time \(t\),
  • \(\Gamma_t\): the total infections at time \(t\), as reported daily. Then \(\Gamma_t = I_t + D_t + R_t\),
  • \(N_t\): the total population at time \(t\). Then \(N_t = S_t + \Gamma_t\) (neglecting normal births and deaths).

The classical SIR model (here augmented by the death rate) then is

\(\frac{dS_t}{dt} = - \beta \frac{S_tI_t}{N_t}\) (1.1)
\(\frac{dI_t}{dt} = \beta \frac{S_tI_t}{N_t} - \mu I_t - \gamma I_t\) (1.2)
\(\frac{dR_t}{dt} = \gamma I_t,\quad \frac{d D_t}{dt} = \mu I_t.\) (1.3)

where linear relations are assumed between the numbers of new deaths, recovered and active infected cases. It is assumed that recovered patients are immune and therefore will not become susceptible again. We assume that we are at a stage of the epidemic where \(D_t \ll N_t\), so that we may replace \(N_t\) by a constant population size \(N\). (Alternatively, one could assume that \(\beta=\beta_t\) is time dependent and that the ratio \(\frac{\beta_t}{N_t}\) remains constant.) Then, formally, there is no longer a difference between \(R_t\) and \(D_t\) because both the recovered and the deceased patients do not reenter the susceptible part \(S_t\) of the population.
Importantly, the model as stated ignores the time delay between infection and recovery/death.

In order to gain some analytical insight, we observe that \(I_t\) appears as a factor in all terms on the right hand side of the model. We therefore rewrite it as

\(\frac{1}{I_t}\frac{dS_t}{dt} = - \frac{\beta}{N}S_t\)
\(\frac{1}{I_t}\frac{dI_t}{dt} = \frac{\beta}{N}S_t - \mu - \gamma\)
\(\frac{1}{I_t}\frac{dR_t}{dt} = \gamma ,\quad \frac{1}{I_t}\frac{d D_t}{dt} = \mu\)

This allows for an explicit solution [3], where we assume that the epidemic starts at time \(t=0\) with \(S_0=N, R_0=D_0=0\).

\(S_t=Ne^{-\xi(t)}\) (1.4)
\(I_t=N-S_t-R_t -D_t\) (1.5)
\(R_t= \frac{\gamma N}{\beta}\xi(t),\quad D_t=\frac{\mu N}{\beta}\xi(t)\) (1.6)


\(\xi(t)=\frac{\beta}{N}\int_0^tI_sds,\) (1.7)

that is, we essentially reparametrize time by the cumulated number of infections.
This yields

\(\frac{d \log I_t}{dt} = \frac{1}{I_t}\frac{dI_t}{dt} = \beta e^{-\xi(t)} -\gamma -\mu\)
\(= \frac{\beta}{N} (N-I_t -R_t -D_t)-\gamma -\mu \quad \text{ since } S_t=N-I_t-R_t-D_t\)
\(= \beta-\gamma -\mu -\frac{\beta}{N}\Gamma_t \quad \quad \text{ since } \Gamma_t= I_t +R_t +D_t,\) (1.8)

and therefore also

\(\frac{d \log I_t}{dt}\le \beta-\gamma -\mu -\frac{\beta}{N}I_t.\) (1.9)

Linear regression method

(1.8) and (1.9) suggest a linear regression, as the growth rate of the number of infections is controlled by a linear function of that infection number itself, with a negative coefficient.
The extrapolation of this linear regression gives us an approximation of the number of total infections at the time when the number of active cases peaks.
In practice, we may not be able to obtain the logarithmic growth rate of \(I\) from (1.8). First of all, the number of infected cases may be underreported, for lack of testing or for political reasons in some countries, and in the latter case even the mortalities \(D\) may be incorrectly reported. Secondly, even if we know \(\Gamma\) and \(D\), recoveries \(R\) may not be faithfully reported so that we do not have an accurate picture for the active cases. Of course, the latter ones are those of concern for predicting the time course of the pandemic, and that is why we have derived the relations (1.8), (1.9) for their growth rate.

In the beginning, the number of infections is small compared to the total population size, and therefore, we may approximate in (1.7)

\(e^{-\xi(t)} \sim 1-\xi(t),\)

which yields the approximation

\(S_t=N -\beta \int_0^tI_sds,\) (1.10)


\(D_t = \mu \int_0^t I_s ds, \quad R_t = \gamma \int_0^t I_s ds,\)
\(I_t \approx (\beta-\mu-\gamma) \int_0^t I_s ds , \quad\) \(\Gamma_t = I_t + D_t + R_t \approx \beta \int_0^t I_s ds.\) (1.11)

Consequently, we get the approximation \(\Gamma_t \approx \frac{\beta}{\beta-\mu-\gamma} I_t\) which, together with (1.8) yields

\(\frac{d \log \Gamma_t}{dt} =\frac{d\Gamma_t}{\Gamma_tdt} \approx \frac{dI_t}{I_tdt} = \beta - \mu - \gamma - \beta \frac{ \Gamma_t}{N}.\) (1.12)

Equation (1.12) gives us a discrete linear relation between the logarithmic growth rate of \(\Gamma_t\) and \(\Gamma_t\) itself. Putting

\(r_t = \log I_{t+1} - \log I_t\) (1.13)

(\(= \log \Gamma_{t+1} - \log \Gamma_t\) under the current approximation), we obtain

\(r_t = \beta - \mu - \gamma - \beta \frac{ \Gamma_t}{N}.\) (1.14)

The extrapolation gives us an approximation \(\bar{\Gamma} = \frac{\beta - \mu - \gamma}{ \beta}\) at the peak of the epidemic.

Problems of the linear regression method

In reality, while \(\mu, \gamma\) can be assumed constant, \(\beta\) cannot since it is the contact rate. In fact, it is precisely the purpose of the social distancing policies to decrease \(\beta\). Therefore, we need to work with a time varying \(\beta_t\) which, among possible other factors, depends on the social distancing policies. The purpose is to let \(\beta_t\) decrease until it is small enough so that

\(\beta - \mu - \gamma \lt 0\) (1.15)

and therefore the number of active cases starts decreasing. On the other hand, there is a delay effect due to the incubation period which stretches the logarithmic growth rate in the horizontal direction. Therefore, the linear regression method only give a very rough approximation of the real development.

Another way to achieve (1.15) would be to develop some efficient treatment to substantially increase the recovery rate. Finally, developing a vaccine would reduce the effective population size \(N\), because vaccinated people are no longer susceptible and therefore need not to be included in \(N\). Because of the factor \(\frac{1}{N}\) in (1.8) and (1.9), reducing the effective population size has a strong effect. Another way to reduce \(N\) would consist in separating the population into smaller groups, without contacts between groups. Then each group would have its own -- small -- effective population size \(N\).

In any case, while we might want to extrapolate the regression line to obtain an estimate for the total number of infected people before the pandemic is under control, linear regression is rather a statistical tool for interpolation. There are some further issues. The relations (1.8), (1.9) derived from linear regression above assume that \(\beta\) be constant, and they are not designed for a time dependent \(\beta_t\). Also, for populations that are as large as those of an entire country, the coefficient \(\frac{\beta}{N}\) is usually very small, so that the effect of the linear decay becomes almost negligible. And it would become even smaller if \(\beta\) is decreased, which of course is not the effect we want to see. It turns out, however, that the comparison of the linear regression plots, in particular when we also use a logarithmic scale for the number of infections, yields a simple scheme to compare the development in different countries and to see both some universal patterns and also on a finer scale the effects of social distance measures. Finally, as we shall see in the sequel, it will offer us a scheme for estimating the parameters of the model.

Coefficients that vary in time

The preceding discussion suggests to make the system coefficients time dependent; these are \(\beta_t\) (social contact rate), \(\mu_t\) (death rate), \(\gamma_t\) (recovery rate). \(\beta_t\) is decreasing in time as the social distancing policies go into effect. \(\mu_t\) may increase due to emerging problems in the public health care system or because the epidemics reaches different age groups at different times, the elderly more slowly than the younger ones. \(\gamma_t\) will increase when a vaccine is found or better medical treatment becomes available.

Since the reports for the absolute case numbers are unreliable and vary considerably between countries, and since the essential quantity to control is the growth rate, we concentrate on the logarithmic growth rate \(r_t = \log \frac{I_{t+1}}{I_t}\). As explained above, our (model based) statistics shows that \(r_t\) is decreasing as a function of the number of reported infections, and we extrapolate this trend by a linear regression line. The death rate, however, is increasing, which raises concerns. But the recovery rate, which is not available in the first phases of the pandemic, also shows a clear sign of increase.

SIR models with delay

To make the model more realistic, however, we also need to include factors like the incubation time that induce delays into the epidemic. In particular, because of the varying incubation time, the number \(I_{t+1} - I_t\) of new infected cases depends on \(\beta_{t-\tau}\), the frequency of social contacts between the susceptible and infected groups at a random time \(t -\tau\) in the past. Likewise, the death rate should be considered as a function of the infections at a time \(t -\tau_D\) in the past, and similarly for the recovery rate.

We therefore propose a variational SIR model which takes into account three time delay factors. They will be considered as random variables, but bounded from above by a non-random constant \(r\). They are

  • the incubation time \(\tau\) with distribution \(\mathbb{P}_\tau\)
  • the death period \(\tau_D\) with distribution \(\mathbb{P}_{\tau_D}\), and
  • the recovery period \(\tau_R\) with distribution \(\mathbb{P}_{\tau_R}\).



We based our observations on the crucial logarithmic growth rate \(r_t = \log \frac{I_{t+1}}{I_t}\). Our statistics shows that

  • The logarithm growth rate of the total infections is decreasing, creating a downtrend towards zero. The growth rate \(r_t\) also decreases with respect to the increase of the total infections \(\Gamma_t\).
  • Currently, the death rate is increasing, which may raise significant concerns for the population.
  • The current recovery rate, which is not available in the first phases of the pandemic, also shows a clear sign of increase.

These phenomena will be explained below from equations (2.3) and (2.4).

Estimating the contact rate

Enriching (1.14) by the stochastic delay effects just described, our analysis suggests the simple approximation

\(r_t \approx \mathbb{E}_{\mathbb{P}_\tau} \beta_{t - \tau} e^{-\tau r_t}.\) (2.1)

When we assume that \(\beta\) remains approximately constant over some time, we can rewrite (2.1) to obtain

\(\beta_{t} \approx \frac{r_t}{\mathbb{E}_{\mathbb{P}_\tau} e^{-\tau r_t}}.\) (2.2)

Hence, the decrease in time of the contact rate \(\beta_t\) due to the social distancing policies could be observed through the decrease of the logarithmic growth rate \(r_t\) of the total infections over time.

Death and recovery mechanism

Thus, let us assume that changes of death and recovered cases over time also depend on some random intervals \(\tau_D\) (distribution \(\mathbb{P}_{\tau_D}\) and bounded by \(r\)) and \(\tau_R\) (distribution \(\mathbb{P}_{\tau_R}\) and bounded by \(r\)). In addition, assume that the number of death cases (recovery cases) with death interval \(\tau_D\) is also proportional to the infected number \(I_{t-\tau_D}\) (respectively \(I_{t-\tau_R}\)) and up to a coefficient \(\mu_{t-\tau_D}\) (respectively \(\gamma_{t-\tau_R}\)) respectively.

Estimating death and recovery rates

By using the piecewise constant property of \(\mu,\gamma\), we derive the following approximations which help to determine the death and recovery rates.

\(\mu_t \approx \frac{D_{t+1} -D_t}{\Gamma_t} \Big[\mathbb{E}_{\mathbb{P}_{\tau_D}} e^{-\tau_D r_t}\Big]^{-1} = \frac{\mathcal{D}_t}{\mathbb{E}_{\mathbb{P}_{\tau_D}} e^{-\tau_D r_t}},\) (2.3)
\(\gamma_t \approx \frac{R_{t+1} -R_t}{\Gamma_t} \Big[\mathbb{E}_{\mathbb{P}_{\tau_R}} e^{-\tau_R r_t}\Big]^{-1} = \frac{\mathcal{R}_t}{\mathbb{E}_{\mathbb{P}_{\tau_R}} e^{-\tau_R r_t}},\)(2.4)

where \(\mathcal{D}_t := \frac{D_{t+1} -D_t}{\Gamma_t}\) and \(\mathcal{R}_t := \frac{R_{t+1} -R_t}{\Gamma_t}\) are the current death and recovery rates. Equations (2.3) and (2.4) show that, as long as \(\mu_t, \gamma_t\) are stable, the decrease in the logarithmic growth rate \(r_t\) leads to increases in the current death and recovery rates.

Estimating the active infected growth rate

To simplify the solution of this equation, we can write \(I_t = e^{\lambda_t t}\) where \(\lambda_t\) is the exponential growth rate of the infected cases. We also assume for simplicity that \(\lambda_t\) is a piecewise constant function. Our analysis shows that \(\lambda_t\) can be determined by solving the equation

\(\lambda_t = \beta_t \mathbb{E}_\mathbb{P} e^{- \lambda_t \tau} - \mu_t \mathbb{E}_{\mathbb{P}_{\tau_D}} e^{- \lambda_t \tau_D} - \gamma_t \mathbb{E}_{\mathbb{P}_{\tau_R}} e^{- \lambda_t \tau_R},\)(2.5)

where \(\beta_t, \mu_t,\gamma_t \) are determined by (2.2), (2.3), (2.4). Then \(\lambda_t\) is also decreasing in \(t\) if we assume that \(\beta_t\) is. If the social contact rate \(\beta_t\) goes to zero in the lock-down scenarios, or the recovery rate increases, then \(\lambda_t\) will eventually become negative so that \(I_t\) goes exponentially down to zero.

Age structures

More realistic models also need to include some spatial structure, since contacts take place locally. This has been abundantly discussed in the literature already, and in particular epidemic spread on networks has been researched in much detail. Here, however, we also see an additional effect. The mortality rate seems to vary considerably between age groups. Younger people may have more contacts, but older ones tend to be more severely affected by the disease. Therefore, it makes a difference whether contacts preferably take place between people in the same age group or between different age groups. Thus, we need to develop a model with an age-structured population. This may then suggest more targeted containment measures that vary between the different age groups.

When one looks at an age-structured population, this is modified as follows (see p.288 in [1]), with all variables and coefficients now becoming dependent on a parameter, the age \(a\).

\(\frac{dS(t,a)}{dt}=-\lambda(t,a)\frac{S(t,a)}{N(t)}+ \rho(a) R(t,a)\) (3.1)
\(\frac{dI(t,a)}{dt}=\lambda(t,a)\frac{S(t,a)}{N(t)}-\mu(a) I(t,a)- \gamma(a)I(t,a)\) (3.2)
\(\frac{dR(t,a)}{dt}=\gamma(a)I(t,a)-\rho(a) R(t,a)\) (3.3)


\(\lambda(t,a)=\int_0^\omega \beta(a,\sigma)I(t,\sigma)d\sigma\) (3.4)

where \(\beta(a,\sigma)\) is the transmission coefficient between susceptible individuals at age \(a\) and infected individuals at age \(\sigma\). Note that in addition to the age structure, here we also allow recovered persons to become susceptible again and then perhaps get reinfected. In fact, it is not clear whether in the current pandemic recovered people are completely immune. It may also be possible that the virus evolves into different strains so that after recovery from one strain, one may still get infected with another strain. Different strains may also be differently lethal.
  Of course, for a more realistic model, the age structure should be combined with the delay effects.


[1]   H.Inaba, Age-structured population dynamics in demography and epidemiology, Springer, 2017
[2]   Kermack, W. O.; McKendrick, A. G. (1927). A Contribution to the Mathematical Theory of Epidemics. Proceedings of the Royal Society A. 115 (772): 700--721. Bibcode:1927RSPSA.115..700K. doi:10.1098/rspa.1927.0118.
[3]   Miller, J.C. (2012). A note on the derivation of epidemic final sizes. Bulletin of Mathematical Biology. 74 (9). section 4.1. doi:10.1007/s11538-012-9749-6. PMC 3506030. PMID 22829179.

Further reading

Stephan Luckhaus - Epidemiology, SIR Models and COVID-19
This article is meant to contribute to the discussion about containment strategies for Corona. Based on the mathematical theory of infectious diseases, we discuss the stability or instability of a state/situation where a fraction \(s_j\) of each age group \(j\) in the population is still susceptible, whereas the rest of the population is already immunized. The simple meaning is, that in an unstable state, the epidemy will breakout again.

In case of an error in this service please contact:
Heiko Schinke (phone: +49 - 341 - 9959 692, email)