Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to site content Skip directly to page options
CDC Home

Simulation Data Sets for Comparison of Aberration Detection Methods

Introduction

An accurate comparison of sensitivity, specificity, and time to detection for aberration detection methods requires data with clearly defined outbreaks. The use of natural data for such comparisons has been difficult. It is often impossible or impractical to investigate the exact starting date of a large outbreak or every small upswing which may or may not be an outbreak. It would be unfeasible to conduct such a detailed investigation on enough data sources to constitute the basis for a statistically valid comparison of detection methods. In this context, artificial data presents itself as the natural solution.

Data Sets

For this study, 56 simulated data sets were generated containing 1,000 iterations of 6 years of daily data, 1994-1999, using a negative binomial distribution with superimposed outbreaks. Means and standard deviations were based on observed values from national and local public health systems and syndromic surveillance systems. Adjustments were made for days of the week, holidays, post-holiday periods, seasonality, and trend.

The data streams are designed to simulate a variety of natural sources and to pose challenges to aberration detection methods. They provide a consistent and well-defined testbed for the objective comparison of detection algorithms.

Outbreak Types

Ten types of outbreaks were randomly placed throughout the data streams. Days for the start of outbreaks were randomly selected using a binomial distribution; one (1) indicated the start of an outbreak and zero (0) indicated no start of an outbreak. The type of outbreak that started on a given day was selected using the remainder obtained by dividing the Julian date (1 - 365) by 10. Outbreaks did not overlap; the smallest allowed time interval from the end of one outbreak to the start of the next was 5 days.

Outbreak types one and two were 1-day spikes of magnitude equal to two and three times the standard deviations of the detrended data, respectively. Outbreak types seven through nine and zero were based on a log normal distribution using two means, each at two and three times the detrended standard deviations. These two means were chosen to represent explosive and more gradual outbreaks in which 95% of the cases appear from 1-4 days and from 7-13 days after exposure, respectively. In outbreak types three through six, the lognormal distribution of attributable symptomatic cases was reversed. See table below for a complete listing of outbreak types.

The lognormal distribution was selected to model the majority of the outbreaks based on work by Sartwell in 1949 and widely used since. The scenario underlying the use of this distribution is a single-source, common-vehicle outbreak, presumably resulting from a bioterrorist attack. The use of the single-day spike signals and the reversed lognormal allows for alternative outbreak scenarios.

Table. Outbreak Types
Outbreak
Type
Distribution
Incubation
Time (days)
ZETA
SIGMA
Peak Size
(X * Std Dev)
1 Spike 1 NA NA 3
2 1 NA NA 2
3 Invert Log
Normal
Less
than 7
1.3 0.4 3
4 2
5 7-14 2.4 0.3 3
6 2
7 Log
Normal
Less
than 7
1.3 0.4 3
8 2
9 7-14 2.4 0.3 3
0 2

Dataset Format

In the datasets themselves, the baseline data generated by the negative binomial distribution is stored as a variable called NEGBIN. The superimposed outbreaks are stored in the OUTCOUNT variable. The two variables are summed in the TOTALCOUNT variable. There are two versions of the datasets available: The "Count Only" version has only the date and TOTALCOUNT values; the "Solution" version contains date and TOTALCOUNT variables in addition to separate variables for the baseline and outbreak data. A description of the variables in each version of the datasets follows:

Table. Variables in Each Version of the Datasets
Variable
Description
Count
Only Set
Solution
Set
Date This is the date for the count values, expressed in SAS date value (not date-time). The date range for each iteration of the dataset is 1/1/94 through 12/31/99. X X
TotalCount The variable contains the sum of the baseline data (NEGBIN) and the outbreak data (OUTCOUNT). This value represents the number of observed cases in natural data and should be used as input for detection methods. X X
Iteration Each dataset contains 1000 iterations of six years worth of data. The iteration number (1-1000) distinguishes the separate iterations. X X
NegBin This value represents the baseline, or noise, data.   X
OutCount This value represents outbreak cases that occurred on top of the NegBin value. OutCount represents the signal value.   X
OBType The value identifies the type of outbreak that is occurring on a given day. A missing value corresponds to no outbreak. See outbreak types.   X

The datasets are available for download at the bottom of this page.

Term Definitions

In comparing detection methods, terms are defined as follows:

Table. Term Definitions
Term
Definition
Outbreak A set of 1 or more consecutive days where the OUTCOUNT variable is greater than zero (0). An outbreak must be bordered on both ends by days where OUTCOUNT does equal zero (0), but only days where OUTCOUNT is greater than zero (0) are part of the outbreak.
Outbreak Day A single day where the OUTCOUNT variable is greater than zero (0).
Non-outbreak
Day
A single day where the OUTCOUNT variable is equal to zero (0).
Sensitivity The sensitivity of a method for a given dataset is defined as the total number of outbreaks (see above definition) during which the method flagged at least 1 day divided by the total number of outbreak periods in that dataset.
Specificity The specificity of a method for a given dataset is defined as the total number of non-outbreak days (see above definition) on which the method did not flag divided by the total number of non-outbreak days in that dataset.
Time to
Detection
The time to detection for a method on a given dataset is defined as the average number of days from the first day of an outbreak until it is flagged by the method. If an outbreak is flagged on the first day, then the time to detection would be zero (0). If an outbreak is flagged on the second day, then the time to detection would be one (1), and so on. If the method fails to flag an outbreak, then the time to detection value should be missing.

Downloads

Download Documentation
Title Download Last Updated
Data Set Parameters Excel
Learn more about Microsoft Excel (36 KB)
Feb 24, 2004
Outbreak Types Excel
Learn more about Microsoft Excel (14 KB)
Feb 24, 2004

PLEASE NOTE: Because of their size, the files below may take considerable time to download or display. To facilitate use of datasets, right-click on the file name and select “Save Target As...” to download the file onto your computer. You can then open the file from the saved location and view it locally.

Download Datasets
Set
Mean
Standard
Deviation
Trend
Seasonality
Download
Size Estimated
Download
Time
File
s01 90.2 33.3 Yes Mild-None 11.87 MB 56K: 30 min
Cable/DSL: 7 min
s01 (ZIP)
s02 29.9 5.6 No Medium 10.02 MB 56K: 25 min
Cable/DSL: 6 min
s02 (ZIP)
s03 1.19 5.75 No Mild-None 7.92 MB 56K: 20 min
Cable/DSL: 4 min
s03 (ZIP)
s04 6 4.3 Yes Very 10.47 MB 56K: 26 min
Cable/DSL: 6 min
s04 (ZIP)
s05 37.5 13.35 No Very 11.12 MB 56K: 28 min
Cable/DSL: 6 min
s05 (ZIP)
s06 1.19 5.75 Yes Very 11.76 MB 56K: 29 min
Cable/DSL: 7 min
s06 (ZIP)
s07 150 26.635 No Mild-None 12.30 MB 56K: 31 min
Cable/DSL: 7 min
s07 (ZIP)
s08 90.2 33.3 No Very 12.63 MB 56K: 32 min
Cable/DSL: 7 min
s08 (ZIP)
s09 29.9 5.6 Yes Mild-None 10.12 MB 56K: 25 min
Cable/DSL: 6 min
s09 (ZIP)
s10 29.9 5.6 No Mild-None 9.83 MB 56K: 25 min
Cable/DSL: 5 min
s10 (ZIP)
s11 301.1 78.8 Yes Medium 13.63 MB 56K: 34 min
Cable/DSL: 8 min
s11 (ZIP)
s12 6 4.3 Yes Medium 9.85 MB 56K: 25 min
Cable/DSL: 6 min
s12 (ZIP)
s13 150 26.635 No Very 12.42 MB 56K: 31 min
Cable/DSL: 7 min
s13 (ZIP)
s14 37.5 13.35 No Medium 10.96 MB 56K: 27 min
Cable/DSL: 6 min
s14 (ZIP)
s15 6 4.3 Yes Mild-None 9.14 MB 56K: 23 min
Cable/DSL: 5 min
s15 (ZIP)
s16 6 4.3 No Mild-None 9.32 MB 56K: 23 min
Cable/DSL: 5 min
s16 (ZIP)
s17 90.2 33.3 Yes Very 12.99 MB 56K: 32 min
Cable/DSL: 7 min
s17 (ZIP)
s18 74.5 20.9 Yes Mild-None 11.26 MB 56K: 28 min
Cable/DSL: 6 min
s18 (ZIP)
s19 90.2 33.3 Yes Very 12.67 MB 56K: 32 min
Cable/DSL: 7 min
s19 (ZIP)
s20 1.19 5.75 No Medium 10.03 MB 56K: 25 min
Cable/DSL: 6 min
s20 (ZIP)
s21 74.5 20.9 Yes Very 11.87 MB 56K: 30 min
Cable/DSL: 7 min
s21 (ZIP)
s22 37.5 13.35 No Mild-None 10.88 MB 56K: 27 min
Cable/DSL: 6 min
s22 (ZIP)
s23 37.5 13.35 Yes Medium 11.14 MB 56K: 28 min
Cable/DSL: 6 min
s23 (ZIP)
s24 3 2.451 No Very 9.54 MB 56K: 24 min
Cable/DSL: 5 min
s24 (ZIP)
s25 90.2 33.3 No Mild-None 11.86 MB 56K: 30 min
Cable/DSL: 7 min
s25 (ZIP)
s26 74.5 20.9 Yes Mild-None 12.12 MB 56K: 30 min
Cable/DSL: 7 min
s26 (ZIP)
s27 150 26.635 Yes Very 12.63 MB 56K: 32 min
Cable/DSL: 7 min
s27 (ZIP)
s28 74.5 20.9 No Mild-None 11.26 MB 56K: 28 min
Cable/DSL: 6 min
s28 (ZIP)
s29 150 26.635 No Medium 12.35 MB 56K: 31 min
Cable/DSL: 7 min
s29 (ZIP)
s30 150 26.6 No Medium 12.36 MB 56K: 31 min
Cable/DSL: 7 min
s30 (ZIP)
s31 150 26.635 Yes Mild-None 12.19 MB 56K: 31 min
Cable/DSL: 7 min
s31 (ZIP)
s32 301.1 78.8 Yes Mild-None 13.07 MB 56K: 33 min
Cable/DSL: 7 min
s32 (ZIP)
s33 29.9 5.6 Yes Medium 10.15 MB 56K: 25 min
Cable/DSL: 6 min
s33 (ZIP)
s34 3 2.451 Yes Mild-None 8.50 MB 56K: 21 min
Cable/DSL: 5 min
s34 (ZIP)
s35 6 4.3 No Very 9.82 MB 56K: 25 min
Cable/DSL: 5 min
s35 (ZIP)
s36 150 26.635 Yes Medium 12.31 MB 56K: 31 min
Cable/DSL: 7 min
s36 (ZIP)
s37 3 2.451 Yes Mild-None 8.61 MB 56K: 21 min
Cable/DSL: 5 min
s37 (ZIP)
s38 29.9 5.6 Yes Very 10.68 MB 56K: 27 min
Cable/DSL: 6 min
s38 (ZIP)
s39 3 2.451 Yes Medium 9.04 MB 56K: 23 min
Cable/DSL: 5 min
s39 (ZIP)
s40 3 2.451 No Very 9.34 MB 56K: 23 min
Cable/DSL: 5 min
s40 (ZIP)
s41 3 2.451 No Medium 8.97 MB 56K: 22 min
Cable/DSL: 5 min
s41 (ZIP)
s42 74.5 20.9 No Medium 11.46 MB 56K: 29 min
Cable/DSL: 6 min
s42 (ZIP)
s43 74.5 20.9 No Very 11.60 MB 56K: 29 min
Cable/DSL: 6 min
s43 (ZIP)
s44 301.1 78.8 No Very 13.17 MB 56K: 33 min
Cable/DSL: 7 min
s44 (ZIP)
s45 37.5 13.35 No Medium 10.93 MB 56K: 27 min
Cable/DSL: 6 min
s45 (ZIP)
s46 3 2.451 No Mild-None 8.51 MB 56K: 21 min
Cable/DSL: 5 min
s46 (ZIP)
s47 74.5 20.9 Yes Medium 11.38 MB 56K: 28 min
Cable/DSL: 6 min
s47 (ZIP)
s48 301.1 78.8 Yes Very 13.81 MB 56K: 34 min
Cable/DSL: 8 min
s48 (ZIP)
s49 1.19 5.75 Yes Medium 10.57 MB 56K: 26 min
Cable/DSL: 6 min
s49 (ZIP)
s50 301.1 78.8 No Mild-None 12.95 MB 56K: 32 min
Cable/DSL: 7 min
s50 (ZIP)
s51 90.2 33.3 No Mild-None 8.63 MB 56K: 22 min
Cable/DSL: 5 min
s51 (ZIP)
s52 90.2 33.3 No Medium 8.67 MB 56K: 22 min
Cable/DSL: 5 min
s52 (ZIP)
s53 37.5 13.35 No Very 8.32 MB 56K: 21 min
Cable/DSL: 5 min
s53 (ZIP)
s54 150 26.635 Yes Mild-None 8.58 MB 56K: 21 min
Cable/DSL: 5 min
s54 (ZIP)
s55 301.1 78.8 Yes Medium 9.13 MB 56K: 23 min
Cable/DSL: 5 min
s55 (ZIP)
s56 3 2.451 Yes Very 7.21 MB 56K: 18 min
Cable/DSL: 4 min
s56 (ZIP)
Contact Us:
  • Centers for Disease Control and Prevention
    1600 Clifton Rd
    Atlanta, GA 30333
  • 800-CDC-INFO
    (800-232-4636)
    TTY: (888) 232-6348
    24 Hours/Every Day
  • cdcinfo@cdc.gov
USA.gov: The U.S. Government's Official Web PortalDepartment of Health and Human Services
Centers for Disease Control and Prevention   1600 Clifton Rd. Atlanta, GA 30333, USA
800-CDC-INFO (800-232-4636) TTY: (888) 232-6348, 24 Hours/Every Day - cdcinfo@cdc.gov

A-Z Index

  1. A
  2. B
  3. C
  4. D
  5. E
  6. F
  7. G
  8. H
  9. I
  10. J
  11. K
  12. L
  13. M
  14. N
  15. O
  16. P
  17. Q
  18. R
  19. S
  20. T
  21. U
  22. V
  23. W
  24. X
  25. Y
  26. Z
  27. #