Simulation Data Sets for Comparison of Aberration Detection Methods
On this page:
Introduction
An accurate comparison of sensitivity, specificity, and time to detection for aberration detection methods requires data with clearly defined outbreaks. The use of natural data for such comparisons has been difficult. It is often impossible or impractical to investigate the exact starting date of a large outbreak or every small upswing which may or may not be an outbreak. It would be unfeasible to conduct such a detailed investigation on enough data sources to constitute the basis for a statistically valid comparison of detection methods. In this context, artificial data presents itself as the natural solution.
Data Sets
For this study, 56 simulated data sets were generated containing 1,000 iterations of 6 years of daily data, 1994-1999, using a negative binomial distribution with superimposed outbreaks. Means and standard deviations were based on observed values from national and local public health systems and syndromic surveillance systems. Adjustments were made for days of the week, holidays, post-holiday periods, seasonality, and trend.
The data streams are designed to simulate a variety of natural sources and to pose challenges to aberration detection methods. They provide a consistent and well-defined testbed for the objective comparison of detection algorithms.
Outbreak Types
Ten types of outbreaks were randomly placed throughout the data streams. Days for the start of outbreaks were randomly selected using a binomial distribution; one (1) indicated the start of an outbreak and zero (0) indicated no start of an outbreak. The type of outbreak that started on a given day was selected using the remainder obtained by dividing the Julian date (1 - 365) by 10. Outbreaks did not overlap; the smallest allowed time interval from the end of one outbreak to the start of the next was 5 days.
Outbreak types one and two were 1-day spikes of magnitude equal to two and three times the standard deviations of the detrended data, respectively. Outbreak types seven through nine and zero were based on a log normal distribution using two means, each at two and three times the detrended standard deviations. These two means were chosen to represent explosive and more gradual outbreaks in which 95% of the cases appear from 1-4 days and from 7-13 days after exposure, respectively. In outbreak types three through six, the lognormal distribution of attributable symptomatic cases was reversed. See table below for a complete listing of outbreak types.
The lognormal distribution was selected to model the majority of the outbreaks based on work by Sartwell in 1949 and widely used since. The scenario underlying the use of this distribution is a single-source, common-vehicle outbreak, presumably resulting from a bioterrorist attack. The use of the single-day spike signals and the reversed lognormal allows for alternative outbreak scenarios.
Dataset Format
In the datasets themselves, the baseline data generated by the negative binomial distribution is stored as a variable called NEGBIN. The superimposed outbreaks are stored in the OUTCOUNT variable. The two variables are summed in the TOTALCOUNT variable. There are two versions of the datasets available: The "Count Only" version has only the date and TOTALCOUNT values; the "Solution" version contains date and TOTALCOUNT variables in addition to separate variables for the baseline and outbreak data. A description of the variables in each version of the datasets follows:
Variable |
Description |
Count Only Set |
Solution Set |
---|---|---|---|
Date | This is the date for the count values, expressed in SAS date value (not date-time). The date range for each iteration of the dataset is 1/1/94 through 12/31/99. | X | X |
TotalCount | The variable contains the sum of the baseline data (NEGBIN) and the outbreak data (OUTCOUNT). This value represents the number of observed cases in natural data and should be used as input for detection methods. | X | X |
Iteration | Each dataset contains 1000 iterations of six years worth of data. The iteration number (1-1000) distinguishes the separate iterations. | X | X |
NegBin | This value represents the baseline, or noise, data. | X | |
OutCount | This value represents outbreak cases that occurred on top of the NegBin value. OutCount represents the signal value. | X | |
OBType | The value identifies the type of outbreak that is occurring on a given day. A missing value corresponds to no outbreak. See outbreak types. | X |
The datasets are available for download at the bottom of this page.
Term Definitions
In comparing detection methods, terms are defined as follows:
Term |
Definition |
---|---|
Outbreak | A set of 1 or more consecutive days where the OUTCOUNT variable is greater than zero (0). An outbreak must be bordered on both ends by days where OUTCOUNT does equal zero (0), but only days where OUTCOUNT is greater than zero (0) are part of the outbreak. |
Outbreak Day | A single day where the OUTCOUNT variable is greater than zero (0). |
Non-outbreak Day |
A single day where the OUTCOUNT variable is equal to zero (0). |
Sensitivity | The sensitivity of a method for a given dataset is defined as the total number of outbreaks (see above definition) during which the method flagged at least 1 day divided by the total number of outbreak periods in that dataset. |
Specificity | The specificity of a method for a given dataset is defined as the total number of non-outbreak days (see above definition) on which the method did not flag divided by the total number of non-outbreak days in that dataset. |
Time to Detection |
The time to detection for a method on a given dataset is defined as the average number of days from the first day of an outbreak until it is flagged by the method. If an outbreak is flagged on the first day, then the time to detection would be zero (0). If an outbreak is flagged on the second day, then the time to detection would be one (1), and so on. If the method fails to flag an outbreak, then the time to detection value should be missing. |
Downloads
Title | Download | Last Updated |
---|---|---|
Data Set Parameters | Excel (36 KB) |
Feb 24, 2004 |
Outbreak Types | Excel (14 KB) |
Feb 24, 2004 |
PLEASE NOTE: Because of their size, the files below may take considerable time to download or display. To facilitate use of datasets, right-click on the file name and select “Save Target As...” to download the file onto your computer. You can then open the file from the saved location and view it locally.
Set |
Mean |
Standard Deviation |
Trend |
Seasonality |
Download | ||
---|---|---|---|---|---|---|---|
Size | Estimated Download Time |
File | |||||
s01 | 90.2 | 33.3 | Yes | Mild-None | 11.87 MB | 56K: 30 min Cable/DSL: 7 min |
s01 (ZIP) |
s02 | 29.9 | 5.6 | No | Medium | 10.02 MB | 56K: 25 min Cable/DSL: 6 min |
s02 (ZIP) |
s03 | 1.19 | 5.75 | No | Mild-None | 7.92 MB | 56K: 20 min Cable/DSL: 4 min |
s03 (ZIP) |
s04 | 6 | 4.3 | Yes | Very | 10.47 MB | 56K: 26 min Cable/DSL: 6 min |
s04 (ZIP) |
s05 | 37.5 | 13.35 | No | Very | 11.12 MB | 56K: 28 min Cable/DSL: 6 min |
s05 (ZIP) |
s06 | 1.19 | 5.75 | Yes | Very | 11.76 MB | 56K: 29 min Cable/DSL: 7 min |
s06 (ZIP) |
s07 | 150 | 26.635 | No | Mild-None | 12.30 MB | 56K: 31 min Cable/DSL: 7 min |
s07 (ZIP) |
s08 | 90.2 | 33.3 | No | Very | 12.63 MB | 56K: 32 min Cable/DSL: 7 min |
s08 (ZIP) |
s09 | 29.9 | 5.6 | Yes | Mild-None | 10.12 MB | 56K: 25 min Cable/DSL: 6 min |
s09 (ZIP) |
s10 | 29.9 | 5.6 | No | Mild-None | 9.83 MB | 56K: 25 min Cable/DSL: 5 min |
s10 (ZIP) |
s11 | 301.1 | 78.8 | Yes | Medium | 13.63 MB | 56K: 34 min Cable/DSL: 8 min |
s11 (ZIP) |
s12 | 6 | 4.3 | Yes | Medium | 9.85 MB | 56K: 25 min Cable/DSL: 6 min |
s12 (ZIP) |
s13 | 150 | 26.635 | No | Very | 12.42 MB | 56K: 31 min Cable/DSL: 7 min |
s13 (ZIP) |
s14 | 37.5 | 13.35 | No | Medium | 10.96 MB | 56K: 27 min Cable/DSL: 6 min |
s14 (ZIP) |
s15 | 6 | 4.3 | Yes | Mild-None | 9.14 MB | 56K: 23 min Cable/DSL: 5 min |
s15 (ZIP) |
s16 | 6 | 4.3 | No | Mild-None | 9.32 MB | 56K: 23 min Cable/DSL: 5 min |
s16 (ZIP) |
s17 | 90.2 | 33.3 | Yes | Very | 12.99 MB | 56K: 32 min Cable/DSL: 7 min |
s17 (ZIP) |
s18 | 74.5 | 20.9 | Yes | Mild-None | 11.26 MB | 56K: 28 min Cable/DSL: 6 min |
s18 (ZIP) |
s19 | 90.2 | 33.3 | Yes | Very | 12.67 MB | 56K: 32 min Cable/DSL: 7 min |
s19 (ZIP) |
s20 | 1.19 | 5.75 | No | Medium | 10.03 MB | 56K: 25 min Cable/DSL: 6 min |
s20 (ZIP) |
s21 | 74.5 | 20.9 | Yes | Very | 11.87 MB | 56K: 30 min Cable/DSL: 7 min |
s21 (ZIP) |
s22 | 37.5 | 13.35 | No | Mild-None | 10.88 MB | 56K: 27 min Cable/DSL: 6 min |
s22 (ZIP) |
s23 | 37.5 | 13.35 | Yes | Medium | 11.14 MB | 56K: 28 min Cable/DSL: 6 min |
s23 (ZIP) |
s24 | 3 | 2.451 | No | Very | 9.54 MB | 56K: 24 min Cable/DSL: 5 min |
s24 (ZIP) |
s25 | 90.2 | 33.3 | No | Mild-None | 11.86 MB | 56K: 30 min Cable/DSL: 7 min |
s25 (ZIP) |
s26 | 74.5 | 20.9 | Yes | Mild-None | 12.12 MB | 56K: 30 min Cable/DSL: 7 min |
s26 (ZIP) |
s27 | 150 | 26.635 | Yes | Very | 12.63 MB | 56K: 32 min Cable/DSL: 7 min |
s27 (ZIP) |
s28 | 74.5 | 20.9 | No | Mild-None | 11.26 MB | 56K: 28 min Cable/DSL: 6 min |
s28 (ZIP) |
s29 | 150 | 26.635 | No | Medium | 12.35 MB | 56K: 31 min Cable/DSL: 7 min |
s29 (ZIP) |
s30 | 150 | 26.6 | No | Medium | 12.36 MB | 56K: 31 min Cable/DSL: 7 min |
s30 (ZIP) |
s31 | 150 | 26.635 | Yes | Mild-None | 12.19 MB | 56K: 31 min Cable/DSL: 7 min |
s31 (ZIP) |
s32 | 301.1 | 78.8 | Yes | Mild-None | 13.07 MB | 56K: 33 min Cable/DSL: 7 min |
s32 (ZIP) |
s33 | 29.9 | 5.6 | Yes | Medium | 10.15 MB | 56K: 25 min Cable/DSL: 6 min |
s33 (ZIP) |
s34 | 3 | 2.451 | Yes | Mild-None | 8.50 MB | 56K: 21 min Cable/DSL: 5 min |
s34 (ZIP) |
s35 | 6 | 4.3 | No | Very | 9.82 MB | 56K: 25 min Cable/DSL: 5 min |
s35 (ZIP) |
s36 | 150 | 26.635 | Yes | Medium | 12.31 MB | 56K: 31 min Cable/DSL: 7 min |
s36 (ZIP) |
s37 | 3 | 2.451 | Yes | Mild-None | 8.61 MB | 56K: 21 min Cable/DSL: 5 min |
s37 (ZIP) |
s38 | 29.9 | 5.6 | Yes | Very | 10.68 MB | 56K: 27 min Cable/DSL: 6 min |
s38 (ZIP) |
s39 | 3 | 2.451 | Yes | Medium | 9.04 MB | 56K: 23 min Cable/DSL: 5 min |
s39 (ZIP) |
s40 | 3 | 2.451 | No | Very | 9.34 MB | 56K: 23 min Cable/DSL: 5 min |
s40 (ZIP) |
s41 | 3 | 2.451 | No | Medium | 8.97 MB | 56K: 22 min Cable/DSL: 5 min |
s41 (ZIP) |
s42 | 74.5 | 20.9 | No | Medium | 11.46 MB | 56K: 29 min Cable/DSL: 6 min |
s42 (ZIP) |
s43 | 74.5 | 20.9 | No | Very | 11.60 MB | 56K: 29 min Cable/DSL: 6 min |
s43 (ZIP) |
s44 | 301.1 | 78.8 | No | Very | 13.17 MB | 56K: 33 min Cable/DSL: 7 min |
s44 (ZIP) |
s45 | 37.5 | 13.35 | No | Medium | 10.93 MB | 56K: 27 min Cable/DSL: 6 min |
s45 (ZIP) |
s46 | 3 | 2.451 | No | Mild-None | 8.51 MB | 56K: 21 min Cable/DSL: 5 min |
s46 (ZIP) |
s47 | 74.5 | 20.9 | Yes | Medium | 11.38 MB | 56K: 28 min Cable/DSL: 6 min |
s47 (ZIP) |
s48 | 301.1 | 78.8 | Yes | Very | 13.81 MB | 56K: 34 min Cable/DSL: 8 min |
s48 (ZIP) |
s49 | 1.19 | 5.75 | Yes | Medium | 10.57 MB | 56K: 26 min Cable/DSL: 6 min |
s49 (ZIP) |
s50 | 301.1 | 78.8 | No | Mild-None | 12.95 MB | 56K: 32 min Cable/DSL: 7 min |
s50 (ZIP) |
s51 | 90.2 | 33.3 | No | Mild-None | 8.63 MB | 56K: 22 min Cable/DSL: 5 min |
s51 (ZIP) |
s52 | 90.2 | 33.3 | No | Medium | 8.67 MB | 56K: 22 min Cable/DSL: 5 min |
s52 (ZIP) |
s53 | 37.5 | 13.35 | No | Very | 8.32 MB | 56K: 21 min Cable/DSL: 5 min |
s53 (ZIP) |
s54 | 150 | 26.635 | Yes | Mild-None | 8.58 MB | 56K: 21 min Cable/DSL: 5 min |
s54 (ZIP) |
s55 | 301.1 | 78.8 | Yes | Medium | 9.13 MB | 56K: 23 min Cable/DSL: 5 min |
s55 (ZIP) |
s56 | 3 | 2.451 | Yes | Very | 7.21 MB | 56K: 18 min Cable/DSL: 4 min |
s56 (ZIP) |
- Page last updated April 16, 2004
- Content source: CDC Emergency Communication System (ECS), Division of Health Communication and Marketing (DHCM), National Center for Health Marketing (NCHM)
Get email updates
To receive email updates about this page, enter your email address:
Contact Us:
- Centers for Disease Control and Prevention
1600 Clifton Rd
Atlanta, GA 30333 - 800-CDC-INFO
(800-232-4636)
TTY: (888) 232-6348
24 Hours/Every Day - cdcinfo@cdc.gov