Data Analysis Perspectives

The Torpedo that Sank Least Squares: The Data Strike Back, Part 1

By Richard Herrington, Ph.D., Data Science Consultant, Data Science, and Analytics

Many different parameter estimation methods are available to data analysts, and chief among them are the least-squares based methods. 
Classical least-squares multiple regression is a favorite among social scientists. Unfortunately, without very large sample sizes (and in
some instances, even with large sample sizes), these methods can run into troubles.   

Un-modeled nonlinearities, non-normal residuals, non-uniform selection probabilities of sample cases, outlying values on
both the x-predictor and y-predictor variables, hidden mixing of groupings, non-fixed nature of the x-predictors
(random effects) – these are a few of the real-world characteristics of data samples that researchers unwittingly 
ignore when performing data analysis. 

In this brief note, we will examine a simulation designed to capture a few of these problematic conditions, and examine the
impact on our ability to draw inferences back to the data-generating mechanisms of the population data.  In part 2 of this note, 
we will look at more flexible approaches at dealing with the hidden, problematic origins of the sample data.

Constructing Our Torpedo
The approach we take here is to generate a large (finite population), and then subsample with replacement from
this larger whole of data.
 
> pop.n<-500000
> samp.n<-100
 
We will want the residuals in the population to have a skewed non-normal shape:
 
> library(sn)
> sn.prob<-psn(seq(-4,1, length=pop.n), alpha=-5)
> sn.draw<-rsn(seq(-4,1, length=pop.n), alpha=-5)
> hist(sn.prob)
> hist(sn.draw)
histograms to accompany research article
Our skewed-normal residuals will more often give less disturbance to population outcome values that are larger in magnitude.  
Next, we create population subgroups with differing location (mean) and spread (standard deviation). 
 
> b.1.pop.bias<-c(sort(rnorm(1:(pop.n/2), .2, .1)), sort(rnorm((pop.n/2+1):pop.n, .8, .8)))
> mean(b.1.pop.bias)
[1] 0.5000733
>  
> b.1.pop.a<-mean(b.1.pop.bias[1:(pop.n/2)])
> b.1.pop.a
[1] 0.1998411
> b.1.pop.b<-mean(b.1.pop.bias[(pop.n/2+1):pop.n])
> b.1.pop.b
[1] 0.8003054
> b1.pop.mix<-(b.1.pop.a+b.1.pop.b)/2
> b1.pop.mix
[1] 0.5000733
 
Group "a" has a mean of .2 with spread .1,  and group "b" has a mean of .8 with spread .8. The resulting combined
mean of both groups is .50 - either by collapsing across the groups to average all observations, or by simple averaging
of the two group means. The two group sub-means are approximately .20 and .80.  Plotting both distributions on the same plot:
 
> par(new=FALSE)
> plot(density(b.1.pop[1:(pop.n/2)]), ylim=c(0,4), xlim=c(-2, 4), 
+      main="Population Density", xlab="")
> par(new=TRUE)
> plot(density(b.1.pop[(pop.n/2+1):pop.n]), ylim=c(0,4), 
+      xlim=c(-2, 4), main="Population Density",xlab="")

diagram of population density

Next,  we create our population model with skewed residuals:
 
> x.obs.pop<-b.1.pop.bias
> pop.residual<-sort(sn.draw)
> pop.residual.a<-pop.residual[1:(pop.n/2)]
> pop.residual.b<-pop.residual[(pop.n/2+1):pop.n]
> pop.residual.a<-sample(pop.residual.a, pop.n/2, replace=TRUE, prob=sn.prob[1:(pop.n/2)])
> pop.residual.b<-sample(pop.residual.b, pop.n/2, replace=TRUE, prob=sn.prob[(pop.n/2+1):pop.n])
> pop.residual<-c(pop.residual.a, pop.residual.b)
> pop.residual<-pop.residual-mean(pop.residual)
> summary(pop.residual)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4460 -0.2663 -0.1716  0.0000  0.3021  1.2150 
> hist(pop.residual

 

The population predictor is set to "b.1.pop.bias".  We sort the skewed normal residual to create two different set of residuals.  
Residuals are randomly sampled from within these two sub-groups to produce residuals for the  outcomes of the two subgroups. 
Lastly,  we mean centered our population residuals:
histogram of population residual
Group "a" with location .2 and spread .1 will have less disturbance, and group "b" with location .8 and
spread .8 will have more disturbance.  We generate our population model with intercept 1 and a fixed
regression coefficient of .8.
 
> y.obs.pop<-  1 + .80*x.obs.pop + pop.residual
hist(y.obs.pop)

histogram of y.obs.pop

We can observe the multiple "bumps" in the histogram.  The hope is that with large enough sample sizes,  that the intercept
of 1 and fixed regressor of .8 will be captured by the modeling of the sample data. We gather up our generated population 
data into a data frame and plot our torpedo:
 
> pop.df<-data.frame(y.obs.pop, x.obs.pop)
> plot(pop.df$x.obs.pop, pop.df$y.obs.pop)

histogram

What does the population model values look like?
 
> mean(1+pop.residual)
[1] 1
> library(lm.beta)
> pop.fit<-lm(pop.df$y.obs.pop ~ pop.df$x.obs.pop)
> summary(lm.beta(pop.fit))
 
Call:
lm(formula = pop.df$y.obs.pop ~ pop.df$x.obs.pop)
Residuals:
    Min      1Q  Median      3Q     Max 
-3.3622 -0.2148 -0.1159  0.2378  1.4958 
Coefficients:
                  Estimate Standardized Std. Error t value            Pr(>|t|)    
(Intercept)      0.8886341    0.0000000  0.0005851    1519 <0.0000000000000002 ***
pop.df$x.obs.pop 1.0226991    0.8958477  0.0007174    1426 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.3268 on 499998 degrees of freedom
Multiple R-squared:  0.8025,    Adjusted R-squared:  0.8025 
F-statistic: 2.032e+06 on 1 and 499998 DF,  p-value: < 0.00000000000000022
 
> coefficients(summary(lm.beta(pop.fit)))
                  Estimate Standardized   Std. Error  t value Pr(>|t|)
(Intercept)      0.8886341    0.0000000 0.0005851081 1518.752        0
pop.df$x.obs.pop 1.0226991    0.8958477 0.0007174070 1425.549        0
 
Analyzing all 500000 cases in the population produces a reversal in the estimates.  The intercept should be 1 and the population
regression coefficient should be close to .80.  A residual plot indicates that "all is not well".

residual plot

Of course, one might contend:  "We would see these irregularities in the data and compensate by predictor transformation, or
by other means.".  Let's see what possible sample would look like when taken n=100 at a time when sampling from our large
finite population:

histogram

The smaller sub-samples obscure the larger deterministic data-generating mechanism such that the sub-samples
don't look so bad.  However, each sub-sample reflects the overall highly biased estimate.
 
> unstd.lm.fit<-lm(y.obs~x.obs, data=sim.data)
> summary(unstd.lm.fit)
Call:
lm(formula = y.obs ~ x.obs, data = sim.data)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.03356 -0.15934 -0.07526  0.21984  1.26029 
Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  0.79210    0.05089   15.56 <0.0000000000000002 ***
x.obs        1.17333    0.06524   17.98 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.3783 on 98 degrees of freedom
Multiple R-squared:  0.7675,    Adjusted R-squared:  0.7651 
F-statistic: 323.4 on 1 and 98 DF,  p-value: < 0.00000000000000022
 
Our goal seems clear at this point:  How do we "discover" the hidden groupings, take into account the non-normality
of the residuals, and take into account the heterogeneity in the subgroups (unequal spread)?  Next time, we'll look at
various approaches to this problem, and I promise: we will be able to recapture the population structure using l
east squares – we will not let a torpedo sink this ship.