Home › Articles › Understanding Descriptive Statistics

mapping

climate

Map Maker

Tornadoes 1950-2004: 2000-2004

Map Layer

Tornadoes 1950-2004

Articles

When and Where Do Tornadoes Occur?

Article

Understanding Descriptive Statistics

Introduction
Descriptive Statistics
Graphical Summaries: Dispersion Graphs
Graphical Summaries: Histograms
Numerical Summaries: Measures of Central Tendency
Numerical Summaries: Measures of Dispersion
Conclusion

Introduction

The term statistics can have several meanings. In one sense, statistics refers to data. For instance, the number of people per county within a State is an example of numerical data. The type of vegetative cover found across a State is an example of non-numerical data. Statistics can also refer to specific mathematical operations performed on data.

maps showing numerical and non-numerical data

When talking about statistics as mathematical operations, there are two basic divisions within the field: descriptive and inferential. Descriptive statistics uses graphical and numerical summaries to give a 'picture' of a data set. Inferential statistics, which use mathematical probabilities, make generalizations about a large group based on data collected from a small sample of that group. This article focuses on descriptive statistics, and on their use.

Descriptive Statistics

To help explain descriptive statistics, we will use the total number of tornadoes recorded by State (including the District of Columbia) from 2000, as shown in Table 1.

Descriptive statistics can include graphical summaries that show the spread of the data, and numerical summaries that either measure the central tendency (a 'typical' data value) of a data set or that describe the spread of the data.

State	Number of Tornadoes	State	Number of Tornadoes
Alabama	44	Montana	10
Alaska	0	Nebraska	60
Arizona	0	Nevada	2
Arkansas	37	New Hampshire	0
California	9	New Jersey	0
Colorado	60	New Mexico	5
Connecticut	1	New York	5
Delaware	0	North Carolina	23
District of Columbia	0	North Dakota	28
Florida	77	Ohio	25
Georgia	28	Oklahoma	44
Hawaii	0	Oregon	3
Idaho	13	Pennsylvania	5
Illinois	55	Rhode Island	1
Indiana	13	South Carolina	20
Iowa	45	South Dakota	18
Kansas	59	Tennessee	27
Kentucky	23	Texas	147
Louisiana	43	Utah	3
Maine	2	Vermont	0
Maryland	8	Virginia	11
Massachusetts	1	Washington	3
Michigan	4	West Virginia	4
Minnesota	32	Wisconsin	18
Mississippi	27	Wyoming	5
Missouri	28
Table 1. The total number of recorded tornadoes in 2000, arranged alphabetically by State and including the District of Columbia. Source: National Oceanic and Atmospheric Administration's National Climatic Data Center

Graphical Summaries: Dispersion Graphs

Dispersion graphs (also called dot plots) are an example of one kind of graphical summary. Researchers use dispersion graphs to identify patterns in data such as concentrations, locations of data 'gaps', or atypical data (i.e. observations that do not fit the general character of the data.) A dispersion graph places individual data values along a number line, thereby representing the position of each data value in relation to all the other data values. Figure 1 shows a dispersion graph of the tornado data from Table 1. We can see that most of the data is concentrated at the lower end of the graph, indicating that most States had 20 or fewer tornadoes, and only a few States had more than 50 tornadoes. Texas had 147 tornadoes during 2000 and this data value is positioned on the far right-hand side of the graph. There are also several gaps in the data where there are no values; these are shown by the green rectangles on the graph. Since the data are concentrated toward the lower end of the dispersion graph, we can say that the number of tornadoes for Texas (147) is atypical of this particular data set. Atypical data values are also referred to as outliers.

Figure 1. A dispersion graph of the 2000 tornado data from Table 1.

Graphical Summaries: Histograms

Another kind of graphical summary is the histogram, which combines data into groups or classes as a way to generalize the details of a data set while at the same time illustrating the data's overall pattern. On a histogram, the x-axis represents the data values arranged into classes while the y-axis shows the number of occurrences in each class.

Figure 2. A histogram showing the 2000 tornado data.

In the Figure 2 histogram we see that the first class contains all the States that experienced between zero and nineteen tornadoes during 2000. Notice that each class has the same width along the x-axis. The decision to set the width of each class at nineteen is arbitrary. A different width could easily be used and would likely change the overall appearance of the histogram. As with dispersion graphs, histograms can show gaps where no data values exist (the 100-119 class). In Figure 2, there are three empty classes: 80-99, 100-119, and 120-139.

When histogram data cluster to one side or the other, the shape of the histogram is described as 'skewed'. In Figure 2, the tornado data are clustered on the lower or left-hand side, which is known as positive skew. Due to the single outlier, the data is said to 'tail' to the positive side.

Figure 3 illustrates the different degrees of skew that are typical of data sets. Data sets that have a greater number of high values, with outliers on the low end of the data scale (data that 'tail' to the negative side), are said to have negative skew. Histogram B in Figure 3 is an example of data having a negative skew. Histogram C in Figure 3 is an example of a normal data set which is without a skew due to the absence of outliers concentrated on one particular side of the distribution.

histograms showing positive skew, negative skew and normal distribution

Figure 3. Histograms displaying examples of different degrees of skew.

Numerical Summaries: Measures of Central Tendency

Measures of central tendency are numerical summaries used to summarize a data set with a single 'typical' number. Three commonly reported measures of central tendency are the mean, median, and mode. With large data sets, the calculation of a measure of central tendency is best handled through a computer software package that will minimize the chance of errors.

Mean
The mean, commonly called the average, is a mathematically computed value which represents a central value of a given data set. The mean is computed by adding all the data values together and dividing by n, where n represents the total number of data values. For our tornado data, adding all the data values together results in 1,076—the total number of tornadoes recorded in all States during 2000. Dividing this total by 51 gives us a mean of 21.1. If we examine the mean in relation to all data values (Figure 4), we can see that the mean lies toward the lower end of the dispersion graph, which makes sense because this is where the majority of the data values are concentrated.

The mean represents a generalization of the data and therefore, interpretation of its value must be done with care or else the value can be misleading. The mean suggests that for any given State there were, on average, 21.1 tornadoes during 2000. A quick glance at Table 2 shows that no State had exactly 21.1 tornadoes—each State had either more or fewer tornadoes than the mean value. However, note that there are a few States with a number of tornadoes close to 21.1. Also note that the mean is influenced by extremes in the data. In other words, in a data set having extremely high or low data values, the mean tends to be 'pulled' in the direction of those outliers and therefore can misrepresent the data's central tendency. Thus, it should not be surprising that the mean for our tornado data is pulled to the right by the value for Texas (147).

Figure 4. A dispersion graph showing the position of the mean number of tornadoes by State for 2000.

Median
If we divide the data into two equal halves where each half contains 50% of the data, the numerical value where the data are divided is called the median. You can also think of the median as the 50th percentile or as the point that would perfectly balance the data if they were placed upon a balance scale. To compute the median, three steps are required. First, the data are ordered by rank, as has been done in Table 2. Second, the data position is calculated. This requires examining the data to determine if there are an even or odd number of data values. The tornado data set has 51 data values, which is an odd number. In this case, where there are an odd number of data values, the following equation is used:

(n + 1)/2 = Rp

where Rp is the rank-position of the median in the rank-ordered data and n represents the number of data values.

Using this equation, we can insert the appropriate values for our data set:

(51 + 1)/2 = 26

which gives the data position of the median in the ranked-order tornado data set, not the median value. Third, to find the median value, look at data position 26 in the rank-ordered data set, which is Virginia. The data value associated with the rank of 26 is 11, which is the median for the tornado data set. The median in this case equally divides the data into two halves, so that there are exactly 25 data values above and 25 data values below the median value of 11.

Rank	State	Number of Tornadoes	Rank	State	Number of Tornadoes
1	Alaska	0	27	Idaho	13
2	Arizona	0	28	Indiana	13
3	District of Columbia	0	29	South Dakota	18
4	Delaware	0	30	Wisconsin	18
5	Hawaii	0	31	South Carolina	20
6	New Hampshire	0	32	Kentucky	23
7	New Jersey	0	33	North Carolina	23
8	Vermont	0	34	Ohio	25
9	Connecticut	1	35	Mississippi	27
10	Massachusetts	1	36	Tennessee	27
11	Rhode Island	1	37	Georgia	28
12	Maine	2	38	Missouri	28
13	Nevada	2	39	North Dakota	28
14	Oregon	3	40	Minnesota	32
15	Utah	3	41	Arkansas	37
16	Washington	3	42	Louisiana	43
17	Michigan	4	43	Alabama	44
18	West Virginia	4	44	Oklahoma	44
19	New Mexico	5	45	Iowa	45
20	New York	5	46	Illinois	55
21	Pennsylvania	5	47	Kansas	59
22	Wyoming	5	48	Colorado	60
23	Maryland	8	49	Nebraska	60
24	California	9	50	Florida	77
25	Montana	10	51	Texas	147
26	Virginia	11
Table 2. The 2000 tornado data from Table 1 ranked in ascending order.

If we look at Figure 5, we see the tornado data median value of 11 on the dispersion graph. Note that for this data set, the median is positioned closer to the lower end of the data values than the mean. This shows that the median is not influenced by outliers as was the mean, but by the number of data values. When a data set has outliers, reporting the median as the central tendency of the data often gives a better 'typical' data value than the mean.

Figure 5. A dispersion graph comparing the median and mean values for the number of tornadoes by State for 2000.

How would you compute a median if there were an even number of data values? In the case where there is an even number of data values, the following equation is used:

Average [(n/2) + ((n/2) +1)] = Rp

where Rp is the rank-position of the median in the rank-ordered data and n represents the number of data values.

Unlike the first equation, when computing the median for an even number of data values, the rank position is the average of the two middle data values. To illustrate, assume, for example, that we removed the data value for Texas leaving us with only 50 data values. Next, begin with the data ranked in order as in Table 2. Substituting the appropriate values into the equation gives us the following rank positions: (50/2) = 25 and ((50/2) +1) = 26. In Table 2, Montana is ranked 25th with 10 tornadoes and Virginia is ranked 26th with 11 tornadoes. If we average the data values corresponding to the 25th and 26th ranks (10 and 11, respectively), we have a median value of 10.5. It is important to remember that, regardless of which equation is used, the resulting Rp number is not the median value, but the rank which can then be used to find the median value.

Mode
The mode is the data value that occurs the most frequently in a data set. Although not used as often as the mean and the median, by identifying the most commonly occurring data value the mode may suggest the central tendency of the data. For the tornado data, the mode is 0. There are eight States that did not experience any tornadoes in 2000. However, it would be misleading to suggest that the central tendency of this data set is 0, since it is obvious from the data values that the value of 0 is not 'central' to the range of values.

Numerical Summaries: Measures of Dispersions

While measures of central tendency summarize a data set with a single 'typical' number, it is also useful to describe the 'spread' of the data with a single number. Describing how a data set is distributed can be accomplished through one of the measures of dispersion: variance, standard deviation, or interquartile range.

Examine once again the dispersion graph in Figure 1. As mentioned earlier, a dispersion graph shows the distribution of the data along the number line. We described the tornado data as concentrated toward the lower end of the number line. However, the data ranges from a low of 0 to a high of 147, which may be considered to be quite a large range. Describing this spread with a single number rather than using words can be more convenient and is the basis of measures of dispersion.

Variance
One measure of dispersion is the variance. Suppose we subtracted each State's tornado data value from the mean (21.1). The resulting value is called a deviation score and tells us the numerical distance between the data value and the data's 'typical' value. Notice in Table 3 that the sum of all the deviation scores equals zero. This results because the data values above and below the mean have positive and negative deviation scores, respectively. In other words, the positive and negative deviation scores cancel each other out. To remove the negative values we can square the deviation scores, and the sum of the squared deviation scores (36,096.5) is called the sum of squares. If we divide the sum of squares by the number of data values (51) the resulting value produces the variance (707.8). The variance then, is the average of the sum of squared deviation scores. By itself, the variance is rarely reported, but is necessary to compute the standard deviation, which is a more meaningful measure of dispersion Table 3 lists the deviation scores and squared deviation scores for our tornado data.

State	Number of Tornadoes	Deviation Scores	Squared Deviation Scores
Alaska	0	-21.1	445.21
Arizona	0	-21.1	445.21
District of Columbia	0	-21.1	445.21
Delaware	0	-21.1	445.21
Hawaii	0	-21.1	445.21
New Hampshire	0	-21.1	445.21
New Jersey	0	-21.1	445.21
Vermont	0	-21.1	445.21
Connecticut	1	-20.1	404.01
Massachusetts	1	-20.1	404.01
Rhode Island	1	-20.1	404.01
Maine	2	-19.1	364.81
Nevada	2	-19.1	364.81
Oregon	3	-18.1	327.61
Utah	3	-18.1	327.61
Washington	3	-18.1	327.61
Michigan	4	-17.1	292.41
West Virginia	4	-17.1	292.41
New Mexico	5	-16.1	259.21
New York	5	-16.1	259.21
Pennsylvania	5	-16.1	259.21
Wyoming	5	-16.1	259.21
Maryland	8	-13.1	171.61
California	9	-12.1	146.41
Montana	10	-11.1	123.21
Virginia	11	-10.1	102.01
Idaho	13	-8.1	65.61
Indiana	13	-8.1	65.61
South Dakota	18	-3.1	9.61
Wisconsin	18	-3.1	9.61
South Carolina	20	-1.1	1.21
Kentucky	23	1.9	3.61
North Carolina	23	1.9	3.61
Ohio	25	3.9	15.21
Mississippi	27	5.9	34.81
Tennessee	27	5.9	34.81
Georgia	28	6.9	47.61
Missouri	28	6.9	47.61
North Dakota	28	6.9	47.61
Minnesota	32	10.9	118.81
Arkansas	37	15.9	252.81
Louisiana	43	21.9	479.61
Alabama	44	22.9	524.41
Oklahoma	44	22.9	524.41
Iowa	45	23.9	571.21
Illinois	55	33.9	1149.21
Kansas	59	37.9	1436.41
Colorado	60	38.9	1513.21
Nebraska	60	38.9	1513.21
Florida	77	55.9	3124.81
Texas	147	125.9	15850.81
		Sum=0.0	Sum=36096.5
Table 3. The 2000 tornado data's deviation scores, squared deviation scores, and their sums which are used to compute the variance and standard deviation.

Standard Deviation
If we take the square root of the variance, the resulting number is called the standard deviation (26.6). The standard deviation is a measure of dispersion and gives us a way to describe where any given data value is located with respect to the mean. Using the standard deviation of 26.6 for the tornado data, we can create bounds around the mean that describe data positions that are ±1, ±2, or ±3 standard deviations. Figure 6 shows the standard deviation bounds around the mean of the tornado data. For example, if we add one standard deviation to and subtract one standard deviation from the mean we arrive at 47.7 and -5.5, respectively. From Figure 6, we can see that most of the data fall within ±1 standard deviation of the mean, which suggests that the data are concentrated about the mean. Notice that as the number of standard deviations increases, fewer data values are found. In fact, only six data values are found beyond ±1 standard deviations from the mean. It is interesting to note that one data value is beyond ±3 standard deviations. When interpreting any standard deviation value it is important to keep in mind that the greater the value of the standard deviation, the more spread out or dispersed a data set is likely to be.

Figure 6. A dispersion graph showing ±1, ±2, and ±3 standard deviations about the mean for the 2000 tornado data.

Interquartile Range
Another measure of dispersion is known as the interquartile range. To calculate the interquartile range, we need to first be familiar with the concept of a quartile. A quartile can be thought of as one of the classes created from the division of an ordered data set into four equally-sized groups. You are already familiar with the 50th quartile, which is median value and divides the data into two equal halves. The 25th quartile has 25% of the data falling below it and the 75th quartile has 75% of the data falling below it. The interquartile range describes the middle one-half (or 50%) of an ordered data set, so represents the range between the data value of the 25th quartile and the data value of the 75th quartile.

In calculating the interquartile range, the first step is to compute the 25th and 75th quartiles and then find the difference between these two quartile values. It is important to realize that when computing a quartile, like the median, the calculation results in a data position in a rank-ordered data set and is not the data value itself.

A quartile is found using the following equation:

(Qp/100) · (n+1)

where Qp is the quartile position value and n is the number of data values.

For example, using the tornado data, the 25th quartile position is 0.25(51+1) = 13 and the 75th quartile position is 0.75(51+1) = 39. Returning to our ranked data in Table 2, we find that the 13th data position is Nevada (2 tornadoes) and the 39th position is North Dakota (28 tornadoes). Having located the 25th and 75th quartiles, now we can compute the interquartile range. The interquartile range is simply the difference between the 75th and 25th quartile. For the tornado data, the difference between the 75th and 25th quartiles is (28-2) = 26. Figure 7 illustrates the bounds of the interquartile range for the tornado data.

Figure 7. The interquartile range for the 2000 tornado data.

A useful illustration of many of the concepts we have discussed in this section is shown in Figure 8, which is a box-and-whisker plot. The green-shaded box represents the interquartile range bounded by the data values that correspond to the 25th and 75th quartiles. Fifty percent of the data values fall within this box, and its length represents the interquartile range. The white line running though the green box is the median. The whiskers are the largest and smallest data values that are not outliers, where an outlier can be considered an atypical data value. Data values that are between 1.5 and 3 interquartile ranges below or above the 25th or 75th quartiles are considered outliers and are represented with an open circle. Data values that are more than 3 interquartile ranges below and above the 25th and 75th quartiles are called extreme values and are represented with an asterisk.

Using the box-and-whisker plot, you can see the position of the central tendency with respect to the interquartile range. In our case, the median is positioned toward the lower end of the data, which suggests that the data is positively skewed. You can also see the length of the interquartile range compared to the entire data set, and identify atypical data values and the degree to which those values are atypical. The numbers on top of the circle and asterisk indicate the rank of the value, and allow you to locate the specific data value in Table 2.

Figure 8. A box-and-whisker plot of the 2000 tornado data set.

Conclusion

This article presented an overview of descriptive statistics. Using the number of tornadoes by state for 2000 as a sample data set, this article discussed numerical and graphical summaries. These summaries provide generalizations of the data, which are often easier to comprehend than a tabular listing of numbers. For example, we calculated that the mean number of tornadoes which occurred in 2000 was 21.1 and that the standard deviation was 26.6. Both of these measures are commonly reported numerical summaries. In addition, a histogram (a common graphical summary) illustrated that the 2000 tornado data had a positive skew—most states had few tornadoes while a few states had a high number of tornadoes. By themselves, the numerical and graphical summaries provided a quick summary of the larger data set without needing to see all 51 state data values.

In a broader sense, the importance of descriptive statistics rests in their utility as tools for interpreting and analyzing data. As an example, measures of central tendency from different years can be directly compared to one another. The mean number of tornadoes in 2000 can be compared to the mean number of tornadoes in 1999 to see which year had a higher mean. In a similar light, means from several years may be compared to one another in order to learn how the mean number of tornadoes has changed over the past 50 years. If the mean has increased, this change may be linked to significant alterations in the Earth's global climate. Measures of dispersion can also be useful in learning which states have abnormally high number of tornadoes across different years. Similarly, graphical summaries provide a visual 'feel' for the data and prompt further inquiry into the data. For example, looking at separate histograms of tornado data from 1985 to 2005, we may learn that in most years that data was positively skewed, but the distribution appeared to have a negative skew in 1988.

Special thanks to Dr. Fritz C. Kessler, of the Department of Geography, Frostburg State University, in Frostburg, Maryland, for his contribution of this article to the National Atlas of the United States^®.


National Atlas of the United States^® and The National Atlas of the United States of America^® are registered trademarks of the United States Department of the Interior Privacy Statement, Disclaimer, Accessibility, FOIA http://nationalatlas.gov/articles/mapping/a_statistics.html Last modified: April 29, 2008 14:58

		Search nationalatlas.gov
		About \| Contact Us \| Partners \| Products \| Site Map \| FAQ \| Help