This function preprocesses raw observations under the
assumption they are reported as cumulative counts by a reference and
report date and is used to assign groups. It also constructs data objects
used by visualisation and modelling functions including the
observed empirical probability of a report on a given day, the cumulative
probability of report, the latest available observations, incidence of
observations, and metadata about the date of reference and report (used to
construct models). This function wraps other preprocessing functions that may
be instead used individually if required. Note that internally reports
beyond the user specified delay are dropped for modelling purposes with the
max_confirm variables allowing the user to check
the impact this may have (if
cum_prop_reported is significantly below 1 a
max_delay may be appropriate). Also note that if missing reference
or report dates are suspected to occur in your data then these need to be
enw_preprocess_data( obs, by = c(), max_delay = 20, set_negatives_to_zero = TRUE, ... )
A data frame containing at least the following variables:
reference date(index date of interest),
report_date(report date for observations),
confirm(cumulative observations by reference and report date).
A character vector describing the stratification of observations. This defaults to no grouping. This should be used when modelling multiple time series in order to identify them for downstream modelling
Numeric defaults to 20. The maximum number of days to include in the delay distribution. Computation scales non-linearly with this setting so consider what maximum makes sense for your data carefully. Note that this is zero indexed and so includes the reference date and
max_delay - 1other days.
Logical, defaults to TRUE. Should negative counts (for calculated incidence of observations) be set to zero. Currently downstream modelling does not support negative counts and so setting must be TRUE if intending to use
Other arguments to
holidays, which sets commonly used metadata (e.g. day of week, days since start of time series)
A data.table containing processed observations as a series of nested data frames as well as variables containing metadata. These are:
obs: (observations with the addition of empirical reporting proportions and and restricted to the specified maximum delay).
new_confirm: Incidence of notifications by reference and report date. Empirical reporting distributions are also added.
latest: The latest available observations.
missing_reference: Observations missing reference dates.
reporting_triangle: Incident observations by report and reference date in the standard reporting triangle matrix format.
metareference: Metadata reference dates derived from observations.
metrareport: Metadata for report dates.
metadelay: Metadata for reporting delays produced using
time: Numeric, number of timepoints in the data.
snapshots: Numeric, number of available data snapshots to use for nowcasting.
groups: Numeric, Number of groups/strata in the supplied observations (set using
max_delay: Numeric, the maximum delay in the processed data
max_date: The maximum available report date.
library(data.table) # Filter example hospitalisation data to be national and over all ages nat_germany_hosp <- germany_covid19_hosp[location == "DE"] nat_germany_hosp <- nat_germany_hosp[age_group %in% "00+"] # Preprocess with default settings pobs <- enw_preprocess_data(nat_germany_hosp) pobs #> obs new_confirm latest #> 1: <data.table[3770x9]> <data.table[3770x11]> <data.table[198x10]> #> missing_reference reporting_triangle metareference #> 1: <data.table[0x6]> <data.table[198x22]> <data.table[198x9]> #> metareport metadelay time snapshots by groups max_delay #> 1: <data.table[217x12]> <data.table[20x4]> 198 198 1 20 #> max_date #> 1: 2021-10-20