## Introduction

The epinowcast package aims to be a modular toolbox for real-time infectious disease surveillance both in outbreak and routine contexts. As such we provide a modular modelling framework that is optimised for a range of common surveillance tasks whilst maintaining flexibility and supporting user extension.

We provide a flexible semi-parametric model for the underlying generative process similar to that implemented in other real-time infectious disease modelling packages[1,2]. This optionally includes a renewal process[3,4] and latent reporting process[1,5,6]. Combined with the appropriate generation time distribution this approach has been shown to correspond to a Susceptible-Exposed-Infected-Recovered (SEIR) model[7] with the addition of reporting delays. Howover, our default model contains minimal mechanism in order to more flexibly fit highly informative data.

Our nowcasting approach is an extension of that proposed by Günther et al.[8] which was itself an extension of the model proposed by Höhle and Heiden[9] and implemented in the surveillance R package[10]. Compared to the model proposed in Günther et al.[8], epinowcast adds support for jointly nowcasting multiple related datasets, a flexible formula interface allowing for the specification of a large range of models, and an optional parametric assumption for the underlying reporting delay.

We also support flexible joint modelling of missing data by assuming that the reporting delay is consistent between reported and unreported observations following the methodology of Lison et al.[11].

Our modelling framework is implemented in the stan probalistic programming language via cmdstanr[12,13] with a focus on computational efficiency and robustness.

In the following sections we outline our modelling methodolgy, current feature set stratified by module, and highlight implementation details. For each module we present both our default implementation as well as the more generic framework we support. Note that extensions to this methodology are warmly welcomed with our community site being a good first point of contact.

## Decomposition into expected final notifications and report delay components

We are concerned with outcomes that occur at a time of reference (e.g., date of symptom onset or test for a disease) that are reported only with a delay, at the time of report (e.g. the date onsets are entered into a central database and so become available for analysis). We assume that these times are measured in discrete time steps, usually of a day (in which case the times are dates).

We follow the approach of Höhle and Heiden[9] and consider the distribution of notifications ($$n_{g,t,d}$$) by time of reference ($$t$$) and reporting delay ($$d$$) conditional on the final observed count $$N_{g,t}$$ for each dataset ($$g$$) such that,

$$$N_{g,t} = \sum_{d=0}^{D} n_{g,t,d}$$$

where $$D$$ represents the maximum delay between time of reference and time of report which in theory could be infinite but in practice we set to a finite value in order to make the model identifiable and computationally feasible. For each $$t$$ and $$g$$ these notifications are assumed to be drawn from a multinomial distribution with $$N_{g,t}$$ trials and a probability vector ($$p_{g,t,d}$$) of length $$D$$. The aim of nowcasting is to predict the final observed counts $$N_{g,t}$$ given information available up to time $$t$$. We do this by estimating the components of this probability vector jointly with the expected number of final notifications ($$\lambda_{g,t} = \mathbb{E}[N_{g,t}]$$) in dataset $$g$$ at time $$t$$.

An alternative approach would be to consider each $$n_{g,t,d}$$ independently at which point the model can be defined as a regression that can be fit using standard software with the appropriate observation model and adjustment for reporting delay (i.e it becomes a Poisson or Negative Binomial regression). An implementation of this approach is available in Bastos et al.[14]. A downside of this simplified approach is that reporting is not conditionally dependent which may make specifying models for complex reporting distributions difficult.

### Default model

Here we follow the approach of Günther et al.[8] and specify the model for expected final notifications as a first order random walk. This simple model is highly flexible and so a good fit for nowcasting problems where the data is highly informative.

\begin{align} \log (\lambda_{g,t}) &\sim \text{Normal}\left(\log (\lambda_{g,t-1}) , \sigma^{\lambda}_{g} \right) \\ \log (\lambda_{g,0}) &\sim \text{Normal}\left(\log (N_{g,0} + 1), 1 \right) \\ \sigma^{\lambda}_{g} &\sim \text{Half-Normal}\left(0, 1\right) \end{align}

where $$N_{g0}$$, the first time point for expected observations in dataset $$d$$, is assumed to have been completely observed.

### Generalised model

In settings where data is sparse or where users want to understand the underlying generative process our flexible default model is likely not a good choice. In these settings our generic model offers a range of options that are context specific. Our generic model is currently based on a renewal process[3,4] with additional latent reporting delays[1,5,6]. As previously noted[7], this corresponds to the commonly used Susceptible-Exposed-Infected-Recovered (SEIR) model when appropriate generation time is specified[7]

### Instantaneous reproduction number/growth rate

We model the instantaneous reproduction number ($$R_t$$) on the log scale (though support for other link functions is planned). When the generation time is fixed to be a day this can be interpreted as the instantaneous growth rate ($$r_t$$) defined as the difference in the log of the expected number of final notifications between time $$t$$ and $$t-1$$.

$$$\text{log} (R_{g,t}) = r_0 + \beta_{f,r} X_{r} + \beta_{r,r} Z_{r}$$$

where $$r_0$$ is the optional intercept, $$X_{r}$$ is the design matrix for fixed effects ($$\beta_{f,r}$$), and $$Z_{r}$$ is the design matrix for random effects ($$\beta_{r,r}$$. Within this specification the default model can be specified as a random effect on the day with no intercept. Alternative specifications that may be of interest include a weekly random walk (specifed as ~ 1 + rw(week)), a piecewise linear model (specified as ~ 1 + day:week), and a group specific random effect (specified as ~ 1 + (1 | .group)).

We model the expected number of infections/latent notifications ($$\lambda^l$$) using a renewal process[3,4]. This model is a generalisation of the default model and can be used to model the expected number of latent notifications in a setting where the generation time is not fixed to be a day. It implies that current infections/notifications are dependent on past infections/notifications based on a kernal (usually interpreted as the generation time or serial interval). An instantaneous daily growth rate model can be recovered by setting the generation time to be fixed at 1 day. The model is defined as follows,

\begin{align} \lambda^l_{g,t} &\sim \text{LogNormal}\left(\mu^{l}_{g,t}, \sigma^{l}_{g,t} \right),\ t \leq P \\ \lambda^l_{g,t} &= R_{g,t} \sum_{p = 1}^{P} G_{g}\left(p, t - p \right) \lambda^l_{g, t-p},\ t \gt P \end{align}

### Latent reporting delay and ascertainment

In some settings there may be additional reporting delays on top of those that are directly observed in the data, and therefore “nowcastable”, a common example is the delay from exposure to symptom onset. For these settings we support modelling “latent” reporting delays as a convolution of the underlying expected counts with the potential for these delays to vary over time and by group. This implementation is similar to that implemented in EpiNow2 and epidemia as well as other similar models[1,5,6,11]. In addition to this we support modelling ascertainment through the use of improper probability mass functions (i.e. by not enforcing a sum to 1 constraint) and inferring ascertainment where possible (for example day of the week reporting patterns).

\begin{align} \lambda_{g,t} &= \nu_{g,t} \sum_{\tau = 0}^{L - 1} F_{g}\left(\tau + 1, t - \tau \right) \lambda^l_{g, t - \tau} \\ \nu_{g,t} &= \nu_0 + \beta_{f,\nu} X_{\nu} + \beta_{r,\nu} Z_{\nu} \end{align}

Where $$\nu_{g,t}$$ is the inferred ascertainment and is modelled flexibly using an optional intercept ($$\nu_0$$), a design matrix ($$X_{\nu}$$) for fixed effects ($$\beta_{f,\nu}$$), and a design matrix ($$Z_{\nu}$$) for random effects ($$\beta_{r,\nu}$$.

## Delay distribution

Where data is available on the report date for a given individual we can estimate the delay distribution directly and jointly rather than relying on estimates from other sources (though we may wish to augment our priors with these estimates). In the following section we describe our default parametric delay distribution model as well as our highly flexible discrete time to event based generic model.

### Default model

In our default model we consider the delay distribution to follow a discretised log-normal as follows,

\begin{align} p_{g,t,d} &\sim \text{LogNormal} \left(\mu^d, \sigma^d \right) \\ \mu^d &\sim \text{Normal} \left(0, 1 \right) \\ \sigma^d &\sim \text{Half-Normal} \left(0, 1 \right) \end{align}

### Generalised model

We generalise this model in order to support a range of delay distributions and reporting processes following the approach of Günther et al.[8] we estimate the delay distribution ($$p_{g,t,d}$$) using a discrete-time logistic hazard model $h_{g,t,d} =\text{P} \left(\text{delay}=d|\text{delay} \geq d, W_{g,t,d}\right)$ but we extend this model to decompose $$W_{g,t,d}$$ into 3 components: hazard derived from a parametric delay distribution ($$\gamma_{g,t,d}$$) dependent on covariates at the time of reference, hazard not derived from a parametric distribution ($$\delta_{g,t,d}$$) dependent on covariates at the time of reference, and hazard dependent on covariates referenced to the time of report ($$\epsilon_{g,t,d}$$).

The first component ($$\gamma_{g,t,d}$$) we estimate what would be the probability of reporting $$p^{\prime}_{g,t,d}$$ at a given time if it followed a parametric distribution, here implemented using a discretised log normal (a range of other distributions are supported by the package) with the log mean and log standard deviation defined using an intercept and arbitrary shared, reference time indexed, covariates with fixed ($$\beta_{f,i}$$) and random ($$\beta_{r,i}$$) coefficients (note these can include auto-regressive terms),

\begin{align} p^{\prime}_{g,t,d} &\sim \text{LogNormal} \left(\mu_{g,t}, \upsilon_{g,t} \right) \\ \mu_{g,t} &= \mu_0 + \beta_{f,\mu} X_{\gamma} + \beta_{r,\mu} Z_{\gamma} \\ \text{log} (\upsilon_{g,t}) &= \upsilon_0 + \beta_{f,\upsilon} X_{\gamma} + \beta_{r,\upsilon} Z_{\gamma} \end{align}

Note we normalise this distribution so that it sums to 1. The parametric logit hazard (i.e. the probability of report at a given time conditional on not already having reported) for this component of the model is then,

$$$\gamma_{g,t,d} = \text{logit} \left(\frac{p^{\prime}_{g,t,d}}{\left(1 -\sum^{d-1}_{d^{\prime}=0} p^{\prime}_{g,t,d^{\prime}} \right)} \right)$$$

In addition to parametric reporting effects there may also be non-parametric effects referenced by both reference and report dates. These are represented by the non-distributional logit hazard components for the time of reference and report, defined using an intercept ($$\delta_0$$) and arbitrary shared covariates with fixed ($$\beta_{f,i}$$) and random ($$\beta_{r,i}$$) coefficients (note these can include auto-regressive terms).

\begin{align} \delta_{g,t,d} &= \delta_0 + \beta_{f,\delta} X_{\delta} + \beta_{r,\delta} Z_{\delta} \\ \epsilon_{g,t,d} &= \beta_{f,\epsilon} X_{\epsilon} + \beta_{r,\epsilon} Z_{\epsilon} \end{align}

The overall hazard for each group, reference time, and delay is then,

$$$\text{logit} (h_{g,t,d}) = \gamma_{g,t,d} + \delta_{g,t,d} + \epsilon_{g,t,d},\ h_{g,t,D} = 1$$$

where the hazard on the final day has been assumed to be 1 in order to enforce the constraint that all reported observations are reported within the specified maximum delay. The probability of report for a given delay, reference time, and group is then as follows,

$$$p_{g,t,0} = h_{g,t,0},\ p_{g,t,d} = \left(1 -\sum^{d-1}_{d^{\prime}=0} p_{g,t,d^{\prime}} \right) \times h_{g,t,d}$$$

All ($$\beta_{f,i}$$) and random ($$\beta_{r,i}$$) coefficients have standard normal priors by default with standard half-normal priors for pooled standard deviations. For further implementation details see enw_reference() for delays linked to the date of reference, enw_report() for delays linked to the date of report.

## Observation model and nowcast

Expected notifications by time of reference ($$t$$) and reporting delay can now be found by multiplying expected final notifications for each $$t$$ with the probability of reporting for each day of delay ($$p_{g,t,d}$$). We assume a negative binomial observation model, by default, with a joint overdispersion parameter (with a standard half normal prior on 1 over square root of the overdispersion[15]) and produce a nowcast of final observed notifications at each reference time by summing posterior estimates for unobserved notification and observed notifications for that reference time.

\begin{align} n_{g,t,d} \mid \lambda_{g,t},p_{g,t,d} &\sim \text{NB} \left((1 - \alpha_{g,t})\lambda_{g,t} \times p_{g,t,d}, \phi \right),\ t=1,...,T. \\ \frac{1}{\sqrt{\phi}} &\sim \text{Half-Normal}(0, 1) \\ N_{g,t} &= \sum_{d=0}^{D} n_{g,t,d} \end{align}

Where $$\alpha_{g,t}$$ is the proportion of cases by reference date that will not report their reference date. By default this is not modelled and is set to zero , see the accounting for reported cases with a missing reference date section for further defaults. Other observation models such as the Poisson distribution are also supported. See the documentation enw_obs() for details.

In order to make best use of observed data when nowcasting we use observations where available and where they have not been reported for a given report and reference date we use the posterior prediction from the observation model above. This means that as nowcast dates become increasingly truncated they depend more on modelled estimates whereas when they are more complete the majority of the final count is known. Depending on your use case the posterior predictions alone may also be of interest.

## Accounting for reported cases with a missing reference date

In real-world settings observations may be reported without a linked reference date. A common example of this is cases by date of symptom onset where report date is often known but onset date may not be. To account for this we support modelling this missing process by assuming that cases with a missing reference date have the same reporting delay distribution as cases with a known reference date and that processes that drive the probability of having a missing reference date (\$_{g,t}) are linked to the unknown date of reference rather than the date of report based on Lison et al.[11]. We model this probability flexibly on a logit scale as follows,

$$$\text{logit} (\alpha_{g,t}) = \alpha_0 + \beta_{f,\alpha} X_{\alpha} + \beta_{r,\alpha} Z_{\alpha}$$$

Where $$\alpha_0$$ represents the intercept, $$\beta_{f,\alpha}$$ fixed effects, and $$\beta_{r,\alpha}$$ random effects. To link with observations by date of report with a missing reference date ($$M_{g,t}$$) we convolve expected notifications with the probability of having a missing reference date and the probability of reporting on a given day as follows,

$$$M_{g,t} \mid \lambda_{g,t},p_{g,t,d}, \alpha_{g,t} \sim \text{NB} \left( \sum^D_{d=0} \alpha_{g,t-d} \lambda_{g,t-d} p_{g,t-d,d}, \phi \right),\ t=1,...,T.$$$

As for cases with known reference dates other observation models are supported. For further implementation details see enw_missing().

## Implementation

The model is implemented in the probalistic programming language stan and we use cmdstanr to interact with the model[12,13]. Optional within chain parallelisation is available across times of reference to reduce runtimes. Sparse design matrices have been used for all covariates to limit the number of probability mass functions that need to be calculated. epinowcast incorporates additional functionality written in R[16] to enable plotting nowcasts and posterior predictions, summarising nowcasts, and scoring them using scoringutils[17]. A flexible formula interface is provided to enable easier implementation of complex user specified models without interacting with the underlying code base. All functionality is modular allowing users to extend and alter the underlying model whilst continuing to use the package framework.

## References

1. Abbott, S., Hellewell, J., Sherratt, K., Gostic, K., Hickson, J., Badr, H. S., DeWitt, M., Thompson, R., EpiForecasts, & Funk, S. (2020). EpiNow2: Estimate real-time case counts and time-varying epidemiological parameters. https://doi.org/10.5281/zenodo.3957489
2. Scott, J. A., Gandy, A., Mishra, S., Unwin, J., Flaxman, S., & Bhatt, S. (2020). Epidemia: Modeling of epidemics using hierarchical bayesian models. https://imperialcollegelondon.github.io/epidemia/
3. Fraser, C. (2007). Estimating individual and household reproduction numbers in an emerging epidemic. PLoS One, 2(8), e758. https://doi.org/10.1371/journal.pone.0000758
4. Cori, A., Ferguson, N. M., Fraser, C., & Cauchemez, S. (2013). A new framework and software to estimate time-varying reproduction numbers during epidemics. Am. J. Epidemiol., 178(9), 1505–1512. https://doi.org/10.1093/aje/kwt133
5. Abbott, S., Hellewell, J., Thompson, R. N., Sherratt, K., Gibbs, H. P., Bosse, N. I., Munday, J. D., Meakin, S., Doughty, E. L., Chun, J. Y., Chan, Y.-W. D., Finger, F., Campbell, P., Endo, A., Pearson, C. A. B., Gimma, A., Russell, T., Flasche, S., Kucharski, A. J., … CMMID COVID modelling group. (2020). Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts. Wellcome Open Res., 5, 112. https://doi.org/10.12688/wellcomeopenres.16006.2
6. Bhatt, S., Ferguson, N., Flaxman, S., Gandy, A., Mishra, S., & Scott, J. A. (2020). Semi-Mechanistic bayesian modeling of COVID-19 with renewal processes. https://arxiv.org/abs/2012.00394
7. Champredon, D., Dushoff, J., & Earn, D. J. D. (2018). Equivalence of the Erlang-Distributed SEIR epidemic model and the renewal equation. SIAM J. Appl. Math., 78(6), 3258–3278. https://doi.org/10.1137/18M1186411
8. Günther, F., Bender, A., Katz, K., Küchenhoff, H., & Höhle, M. (2021). Nowcasting the COVID-19 pandemic in Bavaria. Biometrical Journal, 63(3), 490–502. https://doi.org/10.1002/bimj.202000112
9. Höhle, M., & Heiden, M. an der. (2014). Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011. Biometrics, 70(4), 993–1002. https://doi.org/10.1111/biom.12194
10. Meyer, S., Held, L., & Höhle, M. (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. Journal of Statistical Software, 77(11), 1–55. https://doi.org/10.18637/jss.v077.i11
11. Lison, A. (n.d.). Nowcast-transmission. Github.
12. Team, S. D. (2021). Stan modeling language users guide and reference manual, 2.28.1.
13. Gabry, J., & Češnovar, R. (2021). Cmdstanr: R interface to ’CmdStan’.
14. Bastos, L. S., Economou, T., Gomes, M. F. C., Villela, D. A. M., Coelho, F. C., Cruz, O. G., Stoner, O., Bailey, T., & Codeço, C. T. (2019). A modelling approach for correcting reporting delays in disease surveillance data. Statistics in Medicine, 38(22), 4363–4377. https://doi.org/10.1002/sim.8303
15. Team, S. D. (2020). Prior choice recommendations.
16. R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
17. Bosse, N. (2020). Scoringutils: A collection of proper scoring rules and metrics to assess predictions. https://github.com/epiforecasts/scoringutils