This function preprocesses raw observations under the assumption they are reported as cumulative counts by a reference and report date and is used to assign groups. It also constructs data objects used by visualisation and modelling functions including the observed empirical probability of a report on a given day, the cumulative probability of report, the latest available observations, incidence of observations, and metadata about the date of reference and report (used to construct models). This function wraps other preprocessing functions that may be instead used individually if required. Note that internally reports beyond the user specified delay are dropped for modelling purposes with the cum_prop_reported and max_confirm variables allowing the user to check the impact this may have (if cum_prop_reported is significantly below 1 a longer max_delay may be appropriate). Also note that if missing reference or report dates are suspected to occur in your data then these need to be completed with enw_complete_dates().

## Usage

enw_preprocess_data(
obs,
by = c(),
max_delay = 20,
set_negatives_to_zero = TRUE,
...
)

## Arguments

obs

A data frame containing at least the following variables: reference date (index date of interest), report_date (report date for observations), confirm (cumulative observations by reference and report date).

by

A character vector describing the stratification of observations. This defaults to no grouping. This should be used when modelling multiple time series in order to identify them for downstream modelling

max_delay

Numeric defaults to 20. The maximum number of days to include in the delay distribution. Computation scales non-linearly with this setting so consider what maximum makes sense for your data carefully. Note that this is zero indexed and so includes the reference date and max_delay - 1 other days.

set_negatives_to_zero

Logical, defaults to TRUE. Should negative counts (for calculated incidence of observations) be set to zero. Currently downstream modelling does not support negative counts and so setting must be TRUE if intending to use epinowcast().

...

Other arguments to enw_add_metaobs_features(), e.g. holidays, which sets commonly used metadata (e.g. day of week, days since start of time series)

## Value

A data.table containing processed observations as a series of nested data frames as well as variables containing metadata. These are:

• obs: (observations with the addition of empirical reporting proportions and and restricted to the specified maximum delay).

• new_confirm: Incidence of notifications by reference and report date. Empirical reporting distributions are also added.

• latest: The latest available observations.

• missing_reference: Observations missing reference dates.

• reporting_triangle: Incident observations by report and reference date in the standard reporting triangle matrix format.

• metareference: Metadata reference dates derived from observations.

• metrareport: Metadata for report dates.

• metadelay: Metadata for reporting delays produced using enw_delay_metadata().

• time: Numeric, number of timepoints in the data.

• snapshots: Numeric, number of available data snapshots to use for nowcasting.

• groups: Numeric, Number of groups/strata in the supplied observations (set using by).

• max_delay: Numeric, the maximum delay in the processed data

• max_date: The maximum available report date.

Preprocessing functions enw_add_delay(), enw_add_max_reported(), enw_add_metaobs_features(), enw_assign_group(), enw_complete_dates(), enw_construct_data(), enw_cumulative_to_incidence(), enw_delay_filter(), enw_delay_metadata(), enw_extend_date(), enw_filter_reference_dates(), enw_filter_report_dates(), enw_incidence_to_cumulative(), enw_latest_data(), enw_metadata(), enw_missing_reference(), enw_reporting_triangle_to_long(), enw_reporting_triangle()

## Examples

library(data.table)

# Filter example hospitalisation data to be national and over all ages
nat_germany_hosp <- germany_covid19_hosp[location == "DE"]
nat_germany_hosp <- nat_germany_hosp[age_group %in% "00+"]

# Preprocess with default settings
pobs <- enw_preprocess_data(nat_germany_hosp)
pobs
#>                     obs           new_confirm               latest
#> 1: <data.table[3770x9]> <data.table[3770x11]> <data.table[198x10]>
#>    missing_reference   reporting_triangle       metareference
#> 1: <data.table[0x6]> <data.table[198x22]> <data.table[198x9]>
#>              metareport          metadelay time snapshots by groups max_delay
#> 1: <data.table[217x12]> <data.table[20x4]>  198       198         1        20
#>      max_date
#> 1: 2021-10-20