This function preprocesses raw observations under the
assumption they are reported as cumulative counts by a reference and
report date and is used to assign groups. It also constructs data objects
used by visualisation and modelling functions including the
observed empirical probability of a report on a given day, the cumulative
probability of report, the latest available observations, incidence of
observations, and metadata about the date of reference and report (used to
construct models). This function wraps other preprocessing functions that may
be instead used individually if required. Note that internally reports
beyond the user specified delay are dropped for modelling purposes with the
cum_prop_reported
and max_confirm
variables allowing the user to check
the impact this may have (if cum_prop_reported
is significantly below 1 a
longer max_delay
may be appropriate). Also note that if missing reference
or report dates are suspected to occur in your data then these need to be
completed with enw_complete_dates()
.
Usage
enw_preprocess_data(
obs,
by = NULL,
max_delay = 20,
set_negatives_to_zero = TRUE,
...,
copy = TRUE
)
Arguments
- obs
A
data.frame
containing at least the following variables:reference_date
(index date of interest),report_date
(report date for observations),confirm
(cumulative observations by reference and report date).- by
A character vector describing the stratification of observations. This defaults to no grouping. This should be used when modelling multiple time series in order to identify them for downstream modelling
- max_delay
Numeric defaults to 20 and needs to be greater than or equal to 1 and an integer (internally it will be coerced to one using
as.integer()
). The maximum number of days to include in the delay distribution. Computation scales non-linearly with this setting so consider what maximum makes sense for your data carefully. Note that this is zero indexed and so includes the reference date andmax_delay - 1
other days (i.e. amax_delay
of 1 corresponds with no delay).- set_negatives_to_zero
Logical, defaults to TRUE. Should negative counts (for calculated incidence of observations) be set to zero. Currently downstream modelling does not support negative counts and so setting must be TRUE if intending to use
epinowcast()
.- ...
Other arguments to
enw_add_metaobs_features()
, e.g.holidays
, which sets commonly used metadata (e.g. day of week, days since start of time series)- copy
A logical; if
TRUE
(the default) creates a copy; otherwise, modifiesobs
in place.
Value
A data.table containing processed observations as a series of nested data.frames as well as variables containing metadata. These are:
obs
: (observations with the addition of empirical reporting proportions and and restricted to the specified maximum delay).new_confirm
: Incidence of notifications by reference and report date. Empirical reporting distributions are also added.latest
: The latest available observations.missing_reference
: Observations missing reference dates.reporting_triangle
: Incident observations by report and reference date in the standard reporting triangle matrix format.metareference
: Metadata reference dates derived from observations.metrareport
: Metadata for report dates.metadelay
: Metadata for reporting delays produced usingenw_delay_metadata()
.time
: Numeric, number of timepoints in the data.snapshots
: Numeric, number of available data snapshots to use for nowcasting.groups
: Numeric, Number of groups/strata in the supplied observations (set usingby
).max_delay
: Numeric, the maximum delay in the processed datamax_date
: The maximum available report date.
See also
Preprocessing functions
enw_add_delay()
,
enw_add_max_reported()
,
enw_add_metaobs_features()
,
enw_assign_group()
,
enw_complete_dates()
,
enw_construct_data()
,
enw_delay_filter()
,
enw_delay_metadata()
,
enw_extend_date()
,
enw_filter_reference_dates()
,
enw_filter_report_dates()
,
enw_latest_data()
,
enw_metadata()
,
enw_missing_reference()
,
enw_reporting_triangle_to_long()
,
enw_reporting_triangle()
Examples
library(data.table)
# Filter example hospitalisation data to be national and over all ages
nat_germany_hosp <- germany_covid19_hosp[location == "DE"]
nat_germany_hosp <- nat_germany_hosp[age_group %in% "00+"]
# Preprocess with default settings
pobs <- enw_preprocess_data(nat_germany_hosp)
pobs
#> obs new_confirm latest
#> 1: <data.table[3770x9]> <data.table[3770x11]> <data.table[198x10]>
#> missing_reference reporting_triangle metareference
#> 1: <data.table[0x6]> <data.table[198x22]> <data.table[198x9]>
#> metareport metadelay time snapshots by groups max_delay
#> 1: <data.table[217x12]> <data.table[20x4]> 198 198 1 20
#> max_date
#> 1: 2021-10-20