National Park Service.  

USGS Status and Trends of Biological Resources   -   NPS Inventory and Monitoring

Learn R

R is a free software environment for statistical computing and graphics.
http://www.r-project.org/
Home | Register | Getting Started | Schedule | References | FAQ | Participants | Discussion | Tom's site
Join our e-mailing list for other courses
For more information, please contact Paul Geissler (Paul_Geissler@usgs.gov).

Topic 3: Exploratory Data Analysis

This session assumes that you have installed R and the Rcmdr and DAAG packages as described in topic 1. If you have edited Rprofile.site, R Commander should start when you open R.

I will use R Commander as the R interface. See http://socserv.mcmaster.ca/jfox/Getting-Started-with-the-Rcmdr.pdf and http://socserv.socsci.mcmaster.ca/jfox/Misc/Rcmdr/

Outline:

The Wikipedia describes exploratory data analysis (EDA) as that part of statistical practice concerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system . It was so named by John Tukey . Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.

Tukey held that too much emphasis in statistics was placed on evaluating and testing given hypotheses ( confirmatory data analysis ) and that the balance was in need of redressing in favor of using data to suggest hypotheses to test. In particular, confusion of the two types of analysis and employing them on the same set of data can lead to bias owing to the issues endemic in testing hypotheses suggested by the data .

The objectives of EDA are to:

See http://www.itl.nist.gov/div898/handbook/eda/eda.htm for more information on EDA.

We will follow Maindonald and Braun (2003, Data analysis and graphics using R - an example-based approach, Cambridge University Press) Chapter 2 and Kuhnert and Venables (2005). The latter provides a good description of graphics functions.

You should now have the possum data loaded (see above).

2.1.1 Views of a single sample

Index plot
Use an index plot for a first look at the data to see if there is anything very unusual.It just plots all the data in the order in which it appears in the dataset. Load the possum data from the DAAG package. From the R commander menu, select "Graphs" and then "Index plot".




Use the mouse to identify points with the row labels.

Right click to stop selecting points and then you use the file menu to save the plot to disk or copy it to the Windows clipboard.

Histogram

Maindonald and Braum (2003, page 31) look at female possums, by using the subsetting statement
      fossum = possum[possum$sex=="f", ]
The square brackets select [rows, columns] of the data matrix. A blank after the comma selects all columns (variables). We will cover these data manipulation statements in more detail later. enter or paste this command in the R Commander Script Window and click submit, with the cursor on that command. Scripts from the book are available at http://wwwmaths.anu.edu.au/~johnm/r-book.html .Choose "R scripts" and then "eda.R" . You can then copy the script to the R Commander script window.

To make fossum the active dataset, from the R Commander click on the current active data set in the top left in blue and select another dataset


From the R Commander menu, select "Graphs" then "Histogram" and finally "totlngth" and "frequency counts".

If one selects 10 bins, the histogram is no longer symmetrical and outliers are shown on the left. Histograms are a crude display of the density, depending on the number of bins and the cut points.

A density curve is a better representation of density. To overlay a density curve over the histogram requires script. Enter the following into the R Commander script window, highlight them and click "Submit". In general, there are many more options available using script, but the menus are much easier to use. The book provides more detailed scripts.
attach(fossum)
dens = density(totlngth)
xlim = range(dens$x)
ylim = range(dens$y)
hist(totlngth, probability = T, xlim=xlim, ylim=ylim)
lines(dens)
detach(fossum)

You can get just the density plot with the command
plot(dens)

Stem-and-leaf display
A stem-and-leaf display is a fine grained alternative to a histogram that shows each individual data point. From the R commander window, select "Graphs" and then "Stem-and-leaf display". Select the following options.

> stem.leaf(fossum$totlngth, style="bare", trim.outliers=FALSE)

1 | 2: represents 12   [The number on the left is the cumulative number of observations counting from the tail.]
leaf unit: 1                [The number in parenthesis is the number of observations in that middle row.]
n: 43

 1     7 | 5
        7 |
        7 |
 3     8 | 11
 7     8 | 2233
12    8 | 44555
17    8 | 66677
(13) 8 | 8888899999999
13   9 | 0001111
6     9 | 223
3     9 | 45
1     9 | 6

> sort(fossum$totlngth)
[1] 75.0 81.0 81.5 82.0 82.5 83.0 83.0 84.0 84.5 85.0 85.0 85.5 86.0 86.5 86.5 87.0 87.5 88.0 88.0 88.0 88.5 88.5 89.0
[24] 89.0 89.0 89.0 89.5 89.5 89.5 89.5 90.5 90.5 90.5 91.0 91.0 91.5 91.5 92.0 92.0 93.0 94.0 95.5 96.5

Boxplot
Boxplots provide a graphical representation. Set the possum data to be active. From the R commander menu, select "Data" then "Active data set" and then "Select active data set" and then "possum". Then select "Graphs" and then "Boxplot" and then in the pop-up box select the variable totlngth, check the box "identify outliers with mouse" and click the button "Plot by groups". Then select "sex" as the group.

The center line shows the median, and the box shows the lower and upper quartile (enclosing 50% of the observations). The upper and lower horizontal lines show the minimum and maximum, excluding outliers, which are shown by circles. Left clicking with a mouse will identify the outliers.

2.1.3 Patterns in bivariate data

Load the cuckoos data from the DAAG package. From the 'Graphics" menu, select "Boxplot" for the variable "length" grouped by "Species"

Strip plot
A strip plot provides a more detailed view of the data, but too much data can overwhelm the plot. A strip plot is not available from the menus, so enter and submit the following commands. Paste them in the script window, highlight them, and click submit.
attach(cuckoos)
stripplot(species ~ length)

2.1.3 Patterns in bivariate data

Scatterplot
Select fruitohms from the DAAG package. From the menu, select "Graphics" and then "Scatterplot" and then fill in the following box.


2.1.4 Multiple variables and times

Plot of means
Load the possum data from the DAAG package. From the R Commander menu, select "Graphics" and then "Plot of means". In the box, select "sex" as the factor, "totlngth" as the response variable, and check "confidence interval".

xyplot
A lattice plot is not available from the menu, so we will use the following script. Note that continuation lines must be indented.
xyplot(csoa ~ it | sex*agegp, data=tinting,
     panel=panel.superpose, groups=target, auto.key=TRUE)
Paste these lines into the script window, highlight them, and click submit.

lattice is a very powerful plotting package, including xyplot and other functions. For documentation and examples, enter the commands:
help(xyplot)
example(xyplot)

Scatterplot matrix
A scatterplot matrix is a good way to look at the relationships between pairs of variables. Select possum data again from the DAAG package. From the menu, select "Graphics" and then "Scatterplot matrix " and then fill in the following box.


XY conditioning plot
Still using the possum data, from the menu, select "Graphs" and then "XY conditioning plot". In the box select "age" as an explanatory variable, "footlgth", "hdlngth" and "totlngth" as response variables, and group on "sex".


Line graph
Select the jobs data from the DAAG package. From the menu, select "Graphs" and then "Line graph". In the box, select "Date" as the x variable, "Atlantic" and "Prairies" as the y variable, and check "Plot legend".

Time series plot
We will use the data on monthly deaths of lung disease in UK to illustrate time series plots. These plots are not available from the menu, so we will use the commands. Paste the following commands into the script window, highlight them and click "Submit". You can obtain a description of these commands, using help(ts.plot) or help(legend).
ts.plot(ldeaths,mdeaths,fdeaths,gpars=list(xlab="year",ylab="deaths",lty=c(1:3)))
legend("topright",c("Overall Deaths","Male Deaths","Female Deaths"),lty=c(1:3))

Symbol plot, identifying points
A symbol plot uses the size of the symbol in indicate a size of a third variable. Here we will plot highway mpg vs. price, showing engine size by the size of the circle for 1993 cars. Paste the following commands into the script window, highlight them and click "Submit". You can identify points on the plot by clicking the left mouse button. Note that the third (continuation) line must be indented.
attach(Cars93)
symbols(MPG.highway,Price,circles=EngineSize,xlab="miles per gallon",ylab="Price",inches=0.25,
     main="Area of Circle Proportional to Engine Size")
identify(MPG.city,Price,labels=Make,tolerance=5)

Star plot
Star plots display multivariate data. Paste the following commands into the script window, highlight them and click "Submit". A description of the dataset is available by submitting the command help(mtcars).
stars(mtcars[,1:7],key.loc=c(14,1.8),main="Motor Vehicle Performance",flip.labels=FALSE,draw.segments=T)

Contour plot, overlaying an image
Paste the following commands into the script window, highlight them and click "Submit".
z <- volcano
x <- 10*(1:nrow(z)) # 10m spacing (S to N)
y <- 10*(1:ncol(z)) # 10m spacing (E to W)
image(x,y,z,main="Mt Eden")
contour(x,y,z,add=T)


Exercises

  1. Graphically explore the possum ear conch length (earconch) data from the DAAG package. Is the data bimodal (two peaks), and if so, what factor likely caused this effect.
  2. Import the tomatoes data from ftp://ftpext.usgs.gov/pub/er/md/laurel/BRDScience/learnR2Tomatoes.xls (from ex10.22 in Devore5 package). The data include the tomato yields at four levels of salinity, as measured by electrical conductivity (EC) and a factor (ECf) representing four different levels of EC.
    1. Obtain a scatterplot of yield against EC.
    2. Obtain side by side boxplots of yield for each level of EC.
    3. Comment upon whether the yields are more effectively analyzed using EC as a quantitative or qualitative factor.

Thanks to USGS Fort Collins Science Center for hosting this page for the USGS Biology Science Staff.

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://www.fort.usgs.gov/BRDScience/LearnR3.aspx   Page Contact Information: Paul_Geissler@usgs.gov