Regression Analysis
by Patrick Rasmussen, Victoria Christensen, Xiaodong Jian, and Andrew Ziegler
Contents |
Development of Regression Equations to Estimate Constituent Concentration |
Uncertainty Analysis |
Calculation of Estimated Constituent Loads |
Development of Regression Equations to Estimate Constituent Concentrations
The concentrations of constituents in surface water often are strongly related to other constituent concentrations and factors such as hydrologic conditions, season, and location. It is possible to express one constituent concentration in terms of another constituent or other constituents using simple regression models (Helsel and Hirsch, 1992). Although constituent measurements may be related statistically, it does not necessarily mean that the independent variable causes the concentrations of the dependent variable to occur. Linear regression was used for this study because the estimators of the parameters are from an explicit mathematical expression. The simplest regression equation can be expressed as:
yi = mxi + b + ei, i = 1, 2, ..., n | (1) |
where | ||
yi |
is the i th observation of the dependent (or response) variable; |
|
m |
is the slope; |
|
xi |
is the i th observation of the independent (or explanatory) variable; |
|
b |
is the intercept; |
|
ei |
is the random error for the i th observation; and |
|
n |
is the sample size. |
The terms m and b represent the parameters that need to be estimated from the data set. The most common estimation technique is least squares (Helsel and Hirsch, 1992). In least-squares estimation, the error term, ei, is assumed to be normally distributed with a mean equal to zero and constant variance, s2.
The first step in developing an effective regression model for a specific surface-water site was to plot each possible explanatory variable against the response variable and examine patterns in the data. Some explanatory and response variables (except time) were log transformed. Log transformations of variables can eliminate curvature and simplify analysis of the data (Ott, 1993, p. 454).
Next, to determine which explanatory variable or variables to include in the regression model for each constituent of concern, an overall model-building method (Helsel and Hirsch, 1992, p. 312–314) was used. The possible explanatory variables included each of the cross-section-averaged sensor measurements (specific conductance, pH, water temperature, turbidity, and dissolved oxygen) from the multiparameter monitor, streamflow, stage, and time. All possible regression models were evaluated. Explanatory variables were considered significant if the p-value (probability value) was less than 0.05. If there were several acceptable models (p-value less than 0.05), the one with the lowest PRESS statistic was chosen. Minimizing PRESS (acronym for "PRediction Error Sum of Squares”) means that the equation produces the least error when making new predictions (Helsel and Hirsch, 1992 p. 248). Additionally, explanatory variables were included in a model only if there was a physical basis or explanation for their inclusion.
In addition to the PRESS, two common diagnostic statistics were used to evaluate the regression models. These statistics are the mean square error (MSE) and the coefficient of determination (R²). MSE is a dimensional measure and is calculated using equation 2:
(2) |
where
yi represents the value of y at the ith data point;
E(yi) is the estimated value of y at the ith data point (where E(yi) = mxi + b);
n is the number of samples; and
k is the number of explanatory variables in the model.
The MSE is presented for each equation to assess the variance between predicted and observed values. Dimensional measures often are required in practice for the purpose of comparing constituents or properties with different dimensions (units of measure). A dimensionless measure of fitting y on x is the R², or the fraction of the variance explained by the regression, calculated using equation 3:
R² = 1.0 - (SSE/SSy) |
(3) |
SSE (error sum of squares) and SSy (sums of squares y) are calculated using equations 4 and 5:
(4) |
(5) |
where
y is the mean of y.
The larger the explained variability is compared to the unexplained variability, the better the equation fits the data, and this should lead to a more precise prediction of y (Ott, 1993). The R² ranges from 0 to 1 and often is called the multiple coefficient of determination in multiple linear regression.
Graphical plots were constructed to determine linearity. For certain equations, either the independent variable, dependent variable, or both were transformed to convert all equations presented herein to linear equations. Log transformations of variables can eliminate curvature in the data and simplify the analysis of the data (Ott, 1993, p. 454). For example, by taking the natural log of an independent variable, it is possible to achieve a simple linear equation. Outliers were identified graphically and generally were those points where either the dependent variable or independent variable was more than two standard deviations away from the mean. Each outlier was investigated to determine a possible reason for its occurrence. In all cases, only local outliers (for example, more than two standard deviations away from the dependent variable mean for a fixed range of the independent variable) were removed from the data set. Linear regression equations determined with the least-squares method were developed using the SAS System statistical software (SAS Instititute Inc., 1996, release 6.12) or S-Plus 2000 statistical software (Math Soft, Inc., 1999).
The statistical analysis techniques (regression analysis) used to develop the equations displayed on this page have uncertainty associated with each of the estimates. The water-quality constituent concentrations are estimated statistically by the relation of the explanatory variables (real-time measurements) used in each equation and do not necessarily mean that the explanatory variables cause changes in the estimated concentration. Also, the samples collected from the stream and analyzed at the laboratory may have large analytical errors. Each of the explanatory variables in the equations has measurement errors or uncertainties that are included in any subsequent estimates made with those measured values.
Uncertainty can be defined in a number of different ways. Relative mean absolute error (RMAE) and prediction intervals were chosen for the description of uncertainty in the real-time water-quality studies described on these web pages. The RMAE, expressed as a percentage, is calculated using equation 6:
(6) |
Each in-stream measurement has uncertainty or error that can be expressed as the difference between the measured concentration and the actual concentration. The following table expresses the measurement uncertainty for each of the in-stream measurements. As shown in table 1, part of the uncertainty for estimated concentrations is caused by the uncertainty of measuring the variables in the stream. Links are provided to the documentation of these errors and the acceptance criteria used by USGS. The vast majority of the time these errors are less than 10 percent.
Table 1. Measurement uncertainty for variables used in the regression analysis
Variable |
Defined calibration criteria |
Acceptance criteria (Maximum allowable limits for water-quality sensor measurements) |
± 3 to 6 percent |
-- |
|
± 0.2 degrees Celsius |
± 2.0 degrees Celsius |
|
± 3 percent |
± 30 percent |
|
± 0.2 pH standard units |
± 2.0 pH standard units |
|
± 0.3 milligrams per liter or 5 percent, whichever is smaller |
± 2.0 milligrams per liter or 20 percent, whichever is larger |
|
± 5 percent |
± 30 percent |
|
Fluorescence/chlorophyll |
± 5 percent |
± 30 percent |
Prediction intervals also were determined to evaluate the uncertainty of the estimates using the regression model (Helsel and Hirsch, 1992). Prediction intervals defined a range of values for the dependent variable for a given level of uncertainty. For this web page, 90-percent prediction intervals were determined for each model. For a given independent variable(s), the 90-percent prediction interval represented the range of values expected for the dependent variable 90 percent of the time. The larger the range of values, the more uncertainty there was associated with the regression model. The prediction interval for a single response yi is:
,
(7) |
where
E(yi) is the regression-estimated value, at xi;
t is the value of the student’s t distribution having
n-2
degrees of freedom with
the exceedance probability of
a/2 (value obtained from t
tables in the appendix of most
statistics textbooks);
s is the standard error of regression calculated using equation 8;
n is the number of samples;
xi is a specified value of x;
x is the mean (average) of x; and
SSx is the sum of squares x calculated using equation 10.
(8) |
where
SSy is the sum of squares y;
b1 is the estimate of ;
Sxy is the sums of xy cross products, using equation 9; and
n is the number of samples.
(9) |
where
xi represents the value of x at the ith data point;
x is the mean of x;
yi represents the value of y at the ith data point; and
y is the mean of y.
(10) |
Although prediction intervals are good indicators of uncertainty, a range of values is not very useful for determining the water-quality of a stream. Probability of exceedance provides water managers with a single value for decisionmaking. Probabilities of exceeding designated water-quality criterion were determined for each regression model as follows:
Prob (E(yi) > Criterion) = 1 – the area below the standard normal curve for a value greater than Z, | (11) |
where
Z is calculated using equation 12;
E(yi) is the regression-estimated value at xi;
Criterion is the standard for the constituent of concern.
(12) |
The area under the standard normal curve can be obtained from any statistics textbook that has a table for upper-tailed areas for the standard normal curve.
Annual mean daily loads (AMDL) of constituents are calculated by summation of available hourly loads for a given year, which is calculated using equation 13:
(13) |
where
Ci is the instantaneous concentration at the ith time;
Qi is the instantaneous streamflow at the ith time;
CF conversion factor, table 2; and
n is the number of available hourly values for a given year (8,760 max).
Multiply |
By |
By |
To obtain |
colonies per 100 milliliters (col/100 mL) |
0.02447 |
streamflow, in ft³/s |
billion colonies per day |
micrograms per liter (µg/L) |
0.00539 |
streamflow, in ft³/s |
pounds per day |
milligrams per liter (mg/L) |
5.39 |
streamflow, in ft³/s |
pounds per day |
For equations where the response and explanatory variables were log transformed, retransformation of regression-estimated concentrations was necessary. However, retransformation can cause an underestimation of chemical loads when adding individual load estimates over a long period of time. Multiplying a bias correction factor (BCF) (Duan, 1983) to the annual load calculation allows correction for this underestimation. Cohn and others (1989), Gilroy and others (1990), and Hirsch and others (1993) provide additional information on interpreting the results of regression-based load estimates. Calculation of the BCF is shown in equation 15:
(14) |
where
ei is the residual or the difference between each measured and estimated bacteria density, in log units; and
n is the number of samples.
We want your feedback || Privacy
Statement || Disclaimer
U.S. Department of the Interior || U.S. Geological Survey
URL: http://ks.water.usgs.gov/rtqw/regression.shtml.
Last modified: May 5, 2003