|
|
|
|
|
|
|
Article |
|
Understanding Descriptive Statistics |
|
Introduction
Descriptive
Statistics
Graphical
Summaries: Dispersion Graphs
Graphical
Summaries: Histograms
Numerical
Summaries: Measures of Central Tendency
Numerical
Summaries: Measures of Dispersion
Conclusion
|
|
Introduction |
|
The term statistics can have several meanings. In one sense, statistics
refers to data. For instance, the number of people per county within
a State is an example of numerical data. The type of vegetative
cover found across a State is an example of non-numerical data.
Statistics can also refer to specific mathematical operations performed
on data.
|
|
|
|
When talking about
statistics as mathematical operations, there are two basic divisions
within the field: descriptive and inferential. Descriptive statistics
uses graphical and numerical summaries to give a 'picture' of a data
set. Inferential statistics, which use mathematical probabilities,
make generalizations about a large group based on data collected from
a small sample of that group. This article focuses on descriptive
statistics, and on their use. |
|
back
to top |
|
Descriptive Statistics |
|
To help explain descriptive statistics, we will use the total
number of tornadoes recorded by State (including the District of
Columbia) from 2000, as shown in Table 1.
Descriptive statistics can include graphical summaries that show
the spread of the data, and numerical summaries that either measure
the central tendency (a 'typical' data value) of a data set or
that describe the spread of the data.
|
|
Alabama |
44 |
Montana |
10 |
Alaska |
0 |
Nebraska |
60 |
Arizona |
0 |
Nevada |
2 |
Arkansas |
37 |
New Hampshire |
0 |
California |
9 |
New Jersey |
0 |
Colorado |
60 |
New Mexico |
5 |
Connecticut |
1 |
New York |
5 |
Delaware |
0 |
North Carolina |
23 |
District of Columbia |
0 |
North Dakota |
28 |
Florida |
77 |
Ohio |
25 |
Georgia |
28 |
Oklahoma |
44 |
Hawaii |
0 |
Oregon |
3 |
Idaho |
13 |
Pennsylvania |
5 |
Illinois |
55 |
Rhode Island |
1 |
Indiana |
13 |
South Carolina |
20 |
Iowa |
45 |
South Dakota |
18 |
Kansas |
59 |
Tennessee |
27 |
Kentucky |
23 |
Texas |
147 |
Louisiana |
43 |
Utah |
3 |
Maine |
2 |
Vermont |
0 |
Maryland |
8 |
Virginia |
11 |
Massachusetts |
1 |
Washington |
3 |
Michigan |
4 |
West Virginia |
4 |
Minnesota |
32 |
Wisconsin |
18 |
Mississippi |
27 |
Wyoming |
5 |
Missouri |
28 |
|
Table 1. The total number of recorded tornadoes
in 2000, arranged alphabetically by State and including the
District of Columbia.
Source: National Oceanic and Atmospheric Administration's National
Climatic Data Center |
|
|
back
to top |
|
Graphical Summaries: Dispersion Graphs |
|
Dispersion graphs (also called dot plots) are an example of one
kind of graphical summary. Researchers use dispersion graphs to
identify patterns in data such as concentrations, locations of data
'gaps', or atypical data (i.e. observations that do not fit the
general character of the data.) A dispersion graph places individual
data values along a number line, thereby representing the position
of each data value in relation to all the other data values. Figure
1 shows a dispersion graph of the tornado data from Table
1. We can see that most of the data is concentrated at the lower
end of the graph, indicating that most States had 20 or fewer tornadoes,
and only a few States had more than 50 tornadoes. Texas had 147
tornadoes during 2000 and this data value is positioned on the far
right-hand side of the graph. There are also several gaps in the
data where there are no values; these are shown by the green rectangles
on the graph. Since the data are concentrated toward the lower end
of the dispersion graph, we can say that the number of tornadoes
for Texas (147) is atypical of this particular data set. Atypical
data values are also referred to as outliers.
|
|
Figure 1. A dispersion graph
of the 2000 tornado data from Table 1.
|
|
back
to top |
|
Graphical Summaries: Histograms |
|
Another kind of graphical summary is the histogram, which combines
data into groups or classes as a way to generalize the details of
a data set while at the same time illustrating the data's overall
pattern. On a histogram, the x-axis represents the data values arranged
into classes while the y-axis shows the number of occurrences in
each class.
|
|
Figure 2. A histogram showing
the 2000 tornado data.
|
|
In the Figure 2 histogram we see that the first class contains
all the States that experienced between zero and nineteen tornadoes
during 2000. Notice that each class has the same width along the
x-axis. The decision to set the width of each class at nineteen
is arbitrary. A different width could easily be used and would likely
change the overall appearance of the histogram. As with dispersion
graphs, histograms can show gaps where no data values exist (the
100-119 class). In Figure 2, there are three empty
classes: 80-99, 100-119, and 120-139.
When histogram data cluster to one side or the other, the shape
of the histogram is described as 'skewed'. In Figure
2, the tornado data are clustered on the lower or left-hand
side, which is known as positive skew. Due to the single outlier,
the data is said to 'tail' to the positive side.
Figure 3 illustrates the different degrees of
skew that are typical of data sets. Data sets that have a greater
number of high values, with outliers on the low end of the data
scale (data that 'tail' to the negative side), are said to have
negative skew. Histogram B in Figure 3 is an example
of data having a negative skew. Histogram C in Figure
3 is an example of a normal data set which is without a skew
due to the absence of outliers concentrated on one particular side
of the distribution.
|
|
Figure 3. Histograms displaying
examples of different degrees of skew. |
|
back
to top |
|
Numerical Summaries: Measures of Central Tendency |
|
Measures of central tendency are numerical summaries used to summarize
a data set with a single 'typical' number. Three commonly reported
measures of central tendency are the mean, median, and mode. With
large data sets, the calculation of a measure of central tendency
is best handled through a computer software package that will minimize
the chance of errors.
Mean
The mean, commonly called the average, is a mathematically computed
value which represents a central value of a given data set. The
mean is computed by adding all the data values together and dividing
by n, where n represents the total number of data values. For our
tornado data, adding all the data values together results in 1,076—the
total number of tornadoes recorded in all States during 2000. Dividing
this total by 51 gives us a mean of 21.1. If we examine the mean
in relation to all data values (Figure 4), we
can see that the mean lies toward the lower end of the dispersion
graph, which makes sense because this is where the majority of the
data values are concentrated.
The mean represents a generalization of the data and therefore,
interpretation of its value must be done with care or else the value
can be misleading. The mean suggests that for any given State there
were, on average, 21.1 tornadoes during 2000. A quick glance at
Table 2 shows that no State had exactly 21.1
tornadoes—each State had either more or fewer tornadoes than
the mean value. However, note that there are a few States with a
number of tornadoes close to 21.1. Also note that the mean is influenced
by extremes in the data. In other words, in a data set having extremely
high or low data values, the mean tends to be 'pulled' in the direction
of those outliers and therefore can misrepresent the data's central
tendency. Thus, it should not be surprising that the mean for our
tornado data is pulled to the right by the value for Texas (147).
Figure 4. A dispersion graph
showing the position of the mean number of tornadoes by State for
2000.
|
|
Median
If we divide the data into two equal halves where each half contains
50% of the data, the numerical value where the data are divided
is called the median. You can also think of the median as the 50th
percentile or as the point that would perfectly balance the data
if they were placed upon a balance scale. To compute the median,
three steps are required. First, the data are ordered by rank, as
has been done in Table 2. Second, the data
position is calculated. This requires examining the data to determine
if there are an even or odd number of data values. The tornado data
set has 51 data values, which is an odd number. In this case, where
there are an odd number of data values, the following equation is
used:
(n + 1)/2 = Rp
where Rp is the rank-position of the median in the rank-ordered data
and n represents the number of data values.
Using this equation, we can insert the appropriate values for our
data set:
(51 + 1)/2 = 26
which gives the data position of the median in the ranked-order tornado
data set, not the median value. Third, to find the median value, look
at data position 26 in the rank-ordered data set, which is Virginia.
The data value associated with the rank of 26 is 11, which is the
median for the tornado data set. The median in this case equally divides
the data into two halves, so that there are exactly 25 data values
above and 25 data values below the median value of 11. |
|
1 |
Alaska |
0 |
27 |
Idaho |
13 |
2 |
Arizona |
0 |
28 |
Indiana |
13 |
3 |
District of Columbia |
0 |
29 |
South Dakota |
18 |
4 |
Delaware |
0 |
30 |
Wisconsin |
18 |
5 |
Hawaii |
0 |
31 |
South Carolina |
20 |
6 |
New Hampshire |
0 |
32 |
Kentucky |
23 |
7 |
New Jersey |
0 |
33 |
North Carolina |
23 |
8 |
Vermont |
0 |
34 |
Ohio |
25 |
9 |
Connecticut |
1 |
35 |
Mississippi |
27 |
10 |
Massachusetts |
1 |
36 |
Tennessee |
27 |
11 |
Rhode Island |
1 |
37 |
Georgia |
28 |
12 |
Maine |
2 |
38 |
Missouri |
28 |
13 |
Nevada |
2 |
39 |
North Dakota |
28 |
14 |
Oregon |
3 |
40 |
Minnesota |
32 |
15 |
Utah |
3 |
41 |
Arkansas |
37 |
16 |
Washington |
3 |
42 |
Louisiana |
43 |
17 |
Michigan |
4 |
43 |
Alabama |
44 |
18 |
West Virginia |
4 |
44 |
Oklahoma |
44 |
19 |
New Mexico |
5 |
45 |
Iowa |
45 |
20 |
New York |
5 |
46 |
Illinois |
55 |
21 |
Pennsylvania |
5 |
47 |
Kansas |
59 |
22 |
Wyoming |
5 |
48 |
Colorado |
60 |
23 |
Maryland |
8 |
49 |
Nebraska |
60 |
24 |
California |
9 |
50 |
Florida |
77 |
25 |
Montana |
10 |
51 |
Texas |
147 |
26 |
Virginia |
11 |
|
Table 2. The
2000 tornado data from Table 1 ranked in ascending order. |
|
|
If we look at Figure 5, we see the tornado data
median value of 11 on the dispersion graph. Note that for this data
set, the median is positioned closer to the lower end of the data
values than the mean. This shows that the median is not influenced
by outliers as was the mean, but by the number of data values. When
a data set has outliers, reporting the median as the central tendency
of the data often gives a better 'typical' data value than the mean.
Figure 5. A dispersion graph
comparing the median and mean values for the number of tornadoes
by State for 2000.
How would you compute a median if there were an even number of
data values? In the case where there is an even number of data values,
the following equation is used:
Average [(n/2) + ((n/2) +1)] = Rp
where Rp is the rank-position of the median in the rank-ordered data
and n represents the number of data values.
Unlike the first equation, when computing the median for an even
number of data values, the rank position is the average of the two
middle data values. To illustrate, assume, for example, that we
removed the data value for Texas leaving us with only 50 data values.
Next, begin with the data ranked in order as in Table
2. Substituting the appropriate values into the equation gives
us the following rank positions: (50/2) = 25 and ((50/2) +1) = 26.
In Table 2, Montana is ranked 25th with 10
tornadoes and Virginia is ranked 26th with 11 tornadoes. If we average
the data values corresponding to the 25th and 26th ranks (10 and
11, respectively), we have a median value of 10.5. It is important
to remember that, regardless of which equation is used, the resulting
Rp number is not the median value, but the rank which can then be
used to find the median value.
Mode
The mode is the data value that occurs the most frequently in a
data set. Although not used as often as the mean and the median,
by identifying the most commonly occurring data value the mode may
suggest the central tendency of the data. For the tornado data,
the mode is 0. There are eight States that did not experience any
tornadoes in 2000. However, it would be misleading to suggest that
the central tendency of this data set is 0, since it is obvious
from the data values that the value of 0 is not 'central' to the
range of values.
|
|
back
to top |
|
Numerical Summaries: Measures of Dispersions |
|
While measures of central tendency summarize a data set with a
single 'typical' number, it is also useful to describe
the 'spread' of the data with a single number. Describing
how a data set is distributed can be accomplished through one of
the measures of dispersion: variance, standard deviation, or interquartile
range.
Examine once again the dispersion graph in Figure
1. As mentioned earlier, a dispersion graph shows the distribution
of the data along the number line. We described the tornado data
as concentrated toward the lower end of the number line. However,
the data ranges from a low of 0 to a high of 147, which may be
considered to be quite a large range. Describing this spread
with a single number rather than using words can be more convenient
and is the basis of measures of dispersion.
Variance
One measure of dispersion is the variance. Suppose we subtracted
each State's tornado data value from the mean (21.1). The
resulting value is called a deviation score and tells us the numerical
distance between the data value and the data's 'typical' value.
Notice in Table 3 that the sum of all the deviation
scores equals zero. This results because the data values above
and below the mean have positive and negative deviation scores,
respectively. In other words, the positive and negative deviation
scores cancel each other out. To remove the negative values we
can square the deviation scores, and the sum of the squared deviation
scores (36,096.5) is called the sum of squares. If we divide the
sum of squares by the number of data values (51) the resulting
value produces the variance (707.8). The variance then, is the
average of the sum of squared deviation scores. By itself, the
variance is rarely reported, but is necessary to compute the standard
deviation, which is a more meaningful measure of dispersion Table
3 lists the deviation scores and squared deviation scores for
our tornado data.
|
|
Alaska |
0 |
-21.1 |
445.21 |
Arizona |
0 |
-21.1 |
445.21 |
District of Columbia |
0 |
-21.1 |
445.21 |
Delaware |
0 |
-21.1 |
445.21 |
Hawaii |
0 |
-21.1 |
445.21 |
New Hampshire |
0 |
-21.1 |
445.21 |
New Jersey |
0 |
-21.1 |
445.21 |
Vermont |
0 |
-21.1 |
445.21 |
Connecticut |
1 |
-20.1 |
404.01 |
Massachusetts |
1 |
-20.1 |
404.01 |
Rhode Island |
1 |
-20.1 |
404.01 |
Maine |
2 |
-19.1 |
364.81 |
Nevada |
2 |
-19.1 |
364.81 |
Oregon |
3 |
-18.1 |
327.61 |
Utah |
3 |
-18.1 |
327.61 |
Washington |
3 |
-18.1 |
327.61 |
Michigan |
4 |
-17.1 |
292.41 |
West Virginia |
4 |
-17.1 |
292.41 |
New Mexico |
5 |
-16.1 |
259.21 |
New York |
5 |
-16.1 |
259.21 |
Pennsylvania |
5 |
-16.1 |
259.21 |
Wyoming |
5 |
-16.1 |
259.21 |
Maryland |
8 |
-13.1 |
171.61 |
California |
9 |
-12.1 |
146.41 |
Montana |
10 |
-11.1 |
123.21 |
Virginia |
11 |
-10.1 |
102.01 |
Idaho |
13 |
-8.1 |
65.61 |
Indiana |
13 |
-8.1 |
65.61 |
South Dakota |
18 |
-3.1 |
9.61 |
Wisconsin |
18 |
-3.1 |
9.61 |
South Carolina |
20 |
-1.1 |
1.21 |
Kentucky |
23 |
1.9 |
3.61 |
North Carolina |
23 |
1.9 |
3.61 |
Ohio |
25 |
3.9 |
15.21 |
Mississippi |
27 |
5.9 |
34.81 |
Tennessee |
27 |
5.9 |
34.81 |
Georgia |
28 |
6.9 |
47.61 |
Missouri |
28 |
6.9 |
47.61 |
North Dakota |
28 |
6.9 |
47.61 |
Minnesota |
32 |
10.9 |
118.81 |
Arkansas |
37 |
15.9 |
252.81 |
Louisiana |
43 |
21.9 |
479.61 |
Alabama |
44 |
22.9 |
524.41 |
Oklahoma |
44 |
22.9 |
524.41 |
Iowa |
45 |
23.9 |
571.21 |
Illinois |
55 |
33.9 |
1149.21 |
Kansas |
59 |
37.9 |
1436.41 |
Colorado |
60 |
38.9 |
1513.21 |
Nebraska |
60 |
38.9 |
1513.21 |
Florida |
77 |
55.9 |
3124.81 |
Texas |
147 |
125.9 |
15850.81 |
|
Sum=0.0 |
Sum=36096.5 |
Table 3. The
2000 tornado data's deviation scores, squared deviation scores,
and their sums which are used to compute the variance and standard
deviation. |
|
|
Standard Deviation
If we take the square root of the variance, the resulting number
is called the standard deviation (26.6). The standard deviation
is a measure of dispersion and gives us a way to describe where
any given data value is located with respect to the mean. Using
the standard deviation of 26.6 for the tornado data, we can create
bounds around the mean that describe data positions that are ±1,
±2, or ±3 standard deviations. Figure
6 shows the standard deviation bounds around the mean of the
tornado data. For example, if we add one standard deviation to and
subtract one standard deviation from the mean we arrive at 47.7
and -5.5, respectively. From Figure 6, we can
see that most of the data fall within ±1 standard deviation
of the mean, which suggests that the data are concentrated about
the mean. Notice that as the number of standard deviations increases,
fewer data values are found. In fact, only six data values are found
beyond ±1 standard deviations from the mean. It is interesting
to note that one data value is beyond ±3 standard deviations.
When interpreting any standard deviation value it is important to
keep in mind that the greater the value of the standard deviation,
the more spread out or dispersed a data set is likely to be.
Figure 6. A dispersion graph
showing ±1, ±2, and ±3 standard deviations about the mean for the
2000 tornado data.
Interquartile Range
Another measure of dispersion is known as the interquartile range.
To calculate the interquartile range, we need to first be familiar
with the concept of a quartile. A quartile can be thought of as
one of the classes created from the division of an ordered data
set into four equally-sized groups. You are already familiar with
the 50th quartile, which is median value and divides the data into
two equal halves. The 25th quartile has 25% of the data falling
below it and the 75th quartile has 75% of the data falling below
it. The interquartile range describes the middle one-half (or 50%)
of an ordered data set, so represents the range between the data
value of the 25th quartile and the data value of the 75th quartile.
In calculating the interquartile range, the first step is to compute
the 25th and 75th quartiles and then find the difference between
these two quartile values. It is important to realize that when
computing a quartile, like the median, the calculation results in
a data position in a rank-ordered data set and is not the data value
itself.
A quartile is found using the following equation:
(Qp/100) · (n+1)
where Qp is the quartile position value and n is the number of data
values.
For example, using the tornado data, the 25th quartile position
is 0.25(51+1) = 13 and the 75th quartile position is 0.75(51+1)
= 39. Returning to our ranked data in Table 2,
we find that the 13th data position is Nevada (2 tornadoes) and
the 39th position is North Dakota (28 tornadoes). Having located
the 25th and 75th quartiles, now we can compute the interquartile
range. The interquartile range is simply the difference between
the 75th and 25th quartile. For the tornado data, the difference
between the 75th and 25th quartiles is (28-2) = 26. Figure
7 illustrates the bounds of the interquartile range for the
tornado data.
Figure 7. The interquartile
range for the 2000 tornado data.
A useful illustration of many of the concepts we have discussed
in this section is shown in Figure 8, which is
a box-and-whisker plot. The green-shaded box represents the interquartile
range bounded by the data values that correspond to the 25th and
75th quartiles. Fifty percent of the data values fall within this
box, and its length represents the interquartile range. The white
line running though the green box is the median. The whiskers are
the largest and smallest data values that are not outliers, where
an outlier can be considered an atypical data value. Data values
that are between 1.5 and 3 interquartile ranges below or above the
25th or 75th quartiles are considered outliers and are represented
with an open circle. Data values that are more than 3 interquartile
ranges below and above the 25th and 75th quartiles are called extreme
values and are represented with an asterisk.
Using the box-and-whisker plot, you can see the position of the
central tendency with respect to the interquartile range. In our
case, the median is positioned toward the lower end of the data,
which suggests that the data is positively skewed. You can also
see the length of the interquartile range compared to the entire
data set, and identify atypical data values and the degree to which
those values are atypical. The numbers on top of the circle and
asterisk indicate the rank of the value, and allow you to locate
the specific data value in Table 2.
Figure 8. A box-and-whisker
plot of the 2000 tornado data set.
|
|
back
to top |
|
Conclusion |
|
This article presented an overview of descriptive statistics.
Using the number of tornadoes by state for 2000 as a sample data
set, this article discussed numerical and graphical summaries.
These summaries provide generalizations of the data, which are
often easier to comprehend than a tabular listing of numbers. For
example, we calculated that the mean number of tornadoes which
occurred in 2000 was 21.1 and that the standard deviation was 26.6.
Both of these measures are commonly reported numerical summaries.
In addition, a histogram (a common graphical summary) illustrated
that the 2000 tornado data had a positive skew—most states had
few tornadoes while a few states had a high number of tornadoes.
By themselves, the numerical and graphical summaries provided a
quick summary of the larger data set without needing to see all
51 state data values.
In a broader sense, the importance of descriptive statistics rests
in their utility as tools for interpreting and analyzing data.
As an example, measures of central tendency from different years
can be directly compared to one another. The mean number of tornadoes
in 2000 can be compared to the mean number of tornadoes in 1999
to see which year had a higher mean. In a similar light, means
from several years may be compared to one another in order to learn
how the mean number of tornadoes has changed over the past 50 years.
If the mean has increased, this change may be linked to significant
alterations in the Earth's global climate. Measures of dispersion
can also be useful in learning which states have abnormally high
number of tornadoes across different years. Similarly, graphical
summaries provide a visual 'feel' for the data and
prompt further inquiry into the data. For example, looking at separate
histograms of tornado data from 1985 to 2005, we may learn that
in most years that data was positively skewed, but the distribution
appeared to have a negative skew in 1988.
|
|
|
|
Special thanks to Dr.
Fritz C. Kessler, of the Department
of Geography, Frostburg
State University, in Frostburg, Maryland, for his contribution
of this article to the National Atlas of the United States®.
|
|
back
to top |
|