USGS logo

Regression Analysis

by Patrick Rasmussen, Victoria Christensen, Xiaodong Jian, and Andrew Ziegler

Contents

Development of Regression Equations to Estimate Constituent Concentration
Uncertainty Analysis
Calculation of Estimated Constituent Loads

Development of Regression Equations to Estimate Constituent Concentrations

The concentrations of constituents in surface water often are strongly related to other constituent concentrations and factors such as hydrologic conditions, season, and location. It is possible to express one constituent concentration in terms of another constituent or other constituents using simple regression models (Helsel and Hirsch, 1992). Although constituent measurements may be related statistically, it does not necessarily mean that the independent variable causes the concentrations of the dependent variable to occur. Linear regression was used for this study because the estimators of the parameters are from an explicit math­ematical expression.  The simplest regression equation can be expressed as: 

yi = mxi + b + ei, i = 1, 2, ..., n (1)
where

yi

is the i th observation of the dependent (or response) variable;

m

is the slope;

xi

is the i th observation of the independent (or explanatory) variable;

b

is the intercept;

ei

is the random error for the i th observation; and

n

is the sample size.

The terms m and b represent the parameters that need to be estimated from the data set. The most common estimation technique is least squares (Helsel and Hirsch, 1992). In least-squares estimation, the error term, ei, is assumed to be normally distributed with a mean equal to zero and constant variance, s2.

The first step in developing an effective regression model for a specific surface-water site was to plot each possible explanatory variable against the response variable and examine patterns in the data. Some explanatory and response variables (except time) were log transformed. Log transformations of variables can eliminate curvature and simplify analysis of the data (Ott, 1993, p. 454).

Next, to determine which explanatory variable or variables to include in the regression model for each constituent of concern, an overall model-building method (Helsel and Hirsch, 1992, p. 312–314) was used. The possible explanatory variables included each of the cross-section-averaged sensor measurements (specific conductance, pH, water temperature, turbidity, and dissolved oxygen) from the multiparameter monitor, streamflow, stage, and time. All possible regression models were evaluated. Explanatory variables were considered significant if the p-value (probability value) was less than 0.05. If there were several acceptable models (p-value less than 0.05), the one with the lowest PRESS statistic was chosen. Minimizing PRESS (acronym for "PRediction Error Sum of Squares”) means that the equation produces the least error when making new predictions (Helsel and Hirsch, 1992 p. 248). Additionally, explanatory variables were included in a model only if there was a physical basis or explanation for their inclusion. 

In addition to the PRESS, two common diagnostic statistics were used to evaluate the regression models. These statistics are the mean square error (MSE) and the coefficient of determination (R²). MSE is a dimensional measure and is calculated using equation 2:
(2)

where 

              yi            represents the value of y at the ith data point;

              E(yi)       is the estimated value of y at the ith data point (where E(yi) = mxi + b);

              n            is the number of samples; and

              k            is the number of explanatory variables in the model.

 

The MSE is presented for each equation to assess the variance between predicted and observed values. Dimensional measures often are required in practice for the purpose of comparing constituents or properties with different dimensions (units of measure). A dimensionless measure of fitting y on x is the R², or the fraction of the variance explained by the regression, calculated using equation 3:

  R² = 1.0 - (SSE/SSy)

(3)

SSE (error sum of squares) and SSy (sums of squares y) are calculated using equations 4 and 5:

(4)
(5)

where 

y          is the mean of y.

The larger the explained variability is compared to the unexplained variability, the better the equation fits the data, and this should lead to a more precise prediction of y (Ott, 1993). The R² ranges from 0 to 1 and often is called the multiple coefficient of determination in multiple linear regression.

Graphical plots were constructed to determine linearity. For certain equations, either the independent variable, dependent variable, or both were transformed to convert all equations presented herein to linear equations. Log transformations of variables can eliminate curvature in the data and simplify the analysis of the data (Ott, 1993, p. 454). For example, by taking the natural log of an independent variable, it is possible to achieve a simple linear equation. Outliers were identified graphically and generally were those points where either the dependent variable or independent variable was more than two standard deviations away from the mean. Each outlier was investigated to determine a possible reason for its occurrence. In all cases, only local outliers (for example, more than two standard deviations away from the dependent variable mean for a fixed range of the independent variable) were removed from the data set. Linear regression equations determined with the least-squares method were developed using the SAS System statistical software (SAS Instititute Inc., 1996, release 6.12) or S-Plus 2000 statistical software (Math Soft, Inc., 1999).


Uncertainty Analysis

 

The statistical analysis techniques (regression analysis) used to develop the equations displayed on this page have uncertainty associated with each of the estimates. The water-quality constituent concentrations are estimated statistically by the relation of the explanatory variables (real-time measurements) used in each equation and do not necessarily mean that the explanatory variables cause changes in the estimated concentration. Also, the samples collected from the stream and analyzed at the laboratory may have large analytical errors. Each of the explanatory variables in the equations has measurement errors or uncertainties that are included in any subsequent estimates made with those measured values. 

Uncertainty can be defined in a number of different ways. Relative mean absolute error (RMAE)  and prediction intervals were chosen for the description of uncertainty in the real-time water-quality studies described on these web pages. The RMAE, expressed as a percentage, is calculated using equation 6:

(6)

Each in-stream measurement has uncertainty or error that can be expressed as the difference between the measured concentration and the actual concentration. The following table expresses the measurement uncertainty for each of the in-stream measurements. As shown in table 1, part of the uncertainty for estimated concentrations is caused by the uncertainty of measuring the variables in the stream. Links are provided to the documentation of these errors and the acceptance criteria used by USGS. The vast majority of the time these errors are less than 10 percent.

Table 1. Measurement uncertainty for variables used in the regression analysis

Variable

Defined calibration criteria

Acceptance criteria (Maximum allowable limits for water-quality sensor measurements)

Streamflow

± 3 to 6 percent

--

Water temperature

± 0.2 degrees Celsius

± 2.0 degrees Celsius

Specific conductance

± 3 percent

± 30 percent

pH

± 0.2 pH standard units

± 2.0 pH standard units

Dissolved oxygen

± 0.3 milligrams per liter or 5 percent, whichever is smaller

± 2.0 milligrams per liter or 20 percent, whichever is larger

Turbidity

± 5 percent

± 30 percent

Fluorescence/chlorophyll

± 5 percent

± 30 percent

Prediction intervals also were determined to evaluate the uncertainty of the estimates using the regression model (Helsel and Hirsch, 1992). Prediction intervals defined a range of values for the dependent variable for a given level of uncertainty. For this web page, 90-percent prediction intervals were determined for each model. For a given independent variable(s), the 90-percent prediction interval represented the range of values expected for the dependent variable 90 percent of the time. The larger the range of values, the more uncertainty there was associated with the regression model. The prediction interval for a single response yi is:

,

(7)

where                     

                E(yi)     is the regression-estimated value, at xi;

                t           is the value of the student’s t distribution having n-2 degrees of freedom with 
            the exceedance probability of a/2 (value obtained from t tables in the appendix of most 

                            statistics text­books);

                s           is the standard error of regression calculated using equation 8;

                n           is the number of samples;

                xi          is a specified value of x;

                x         is the mean (average) of x; and

                SSx       is the sum of squares x calculated using equation 10.

(8)

where

                        SSy         is the sum of squares y;

                        b1            is the estimate of ;

                        Sxy        is the sums of xy cross products, using equation 9; and

                        n          is the number of samples.

(9)

where

                x            represents the value of x at the ith data point;

                x          is the mean of x;

                yi             represents the value of y at the ith data point; and

                y          is the mean of y.

(10)

Although prediction intervals are good indicators of uncertainty, a range of values is not very useful for determining the water-quality of a stream. Probability of exceedance provides water managers with a sin­gle value for decisionmaking. Probabilities of exceeding designated water-quality criterion were determined for each regression model as follows:

Prob (E(yi) > Criterion) = 1 – the area below the standard normal curve for a value greater than Z, (11)

where 

                        Z          is calculated using equation 12;

E(yi)         is the regression-estimated value at xi;

                Criterion         is the standard for the constituent of concern.

(12)

 

The area under the standard normal curve can be obtained from any statistics textbook that has a table for upper-tailed areas for the standard normal curve. 


Calculation of Estimated Constituent Loads

Annual mean daily loads (AMDL) of constituents are calculated by summation of available hourly loads for a given year, which is calculated using equation 13:

(13)

where

              Ci                is the instantaneous concentration at the ith time; 

              Qi                is the instantaneous streamflow at the ith time; 

              CF           conversion factor, table 2; and

              n              is the number of available hourly values for a given year (8,760 max).

 

Table 2. Conversion factors used in calculation of measured and estimated loads

[ft³/s, cubic feet per second]

Multiply

By

By

To obtain

colonies per 100 milliliters (col/100 mL)

0.02447

streamflow, in ft³/s

billion colonies per day

micrograms per liter (µg/L)

0.00539

streamflow, in ft³/s

pounds per day

milligrams per liter (mg/L)

5.39

streamflow, in ft³/s

pounds per day

For equations where the response and explanatory variables were log transformed, retransformation of regression-estimated concentrations was necessary. However, retransformation can cause an underestimation of chemical loads when adding individual load estimates over a long period of time. Multiplying a bias correction factor (BCF) (Duan, 1983) to the annual load calculation allows correction for this underestimation. Cohn and others (1989), Gilroy and others (1990), and Hirsch and others (1993) provide additional information on interpreting the results of regression-based load estimates. Calculation of the BCF is shown in equation 15:  

(14)

where 

              ei                is the residual or the difference between each measured and estimated bacteria density, in log units; and

              n            is the number of samples.


We want your feedback || Privacy Statement || Disclaimer
U.S. Department of the Interior || U.S. Geological Survey
URL: http://ks.water.usgs.gov/rtqw/regression.shtml. 
Last modified: May 5, 2003