Many different parameter estimation methods are available to data analysts, and chief among them are the least-squares based methods.
Classical least-squares multiple regression is a favorite among social scientists. Unfortunately, without very large sample sizes (and in
some instances, even with large sample sizes), these methods can run into troubles.
Un-modeled nonlinearities, non-normal residuals, non-uniform selection probabilities of sample cases, outlying values on
both the x-predictor and y-predictor variables, hidden mixing of groupings, non-fixed nature of the x-predictors
(random effects) – these are a few of the real-world characteristics of data samples that researchers unwittingly
ignore when performing data analysis.
In this brief note, we will examine a simulation designed to capture a few of these problematic conditions, and examine the
impact on our ability to draw inferences back to the data-generating mechanisms of the population data. In part 2 of this note,
we will look at more flexible approaches at dealing with the hidden, problematic origins of the sample data.
The approach we take here is to generate a large (finite population), and then subsample with replacement from
this larger whole of data.
We will want the residuals in the population to have a skewed non-normal shape:
Our skewed-normal residuals will more often give less disturbance to population outcome values that are larger in magnitude.
Next, we create population subgroups with differing location (mean) and spread (standard deviation).
> b.1.pop.bias<-c(sort(rnorm(1:(pop.n/2), .2, .1)), sort(rnorm((pop.n/2+1):pop.n, .8, .8)))
> mean(b.1.pop.bias)
[1] 0.5000733
>
> b.1.pop.a<-mean(b.1.pop.bias[1:(pop.n/2)])
> b.1.pop.a
[1] 0.1998411
> b.1.pop.b<-mean(b.1.pop.bias[(pop.n/2+1):pop.n])
> b.1.pop.b
[1] 0.8003054
>
> b1.pop.mix<-(b.1.pop.a+b.1.pop.b)/2
> b1.pop.mix
[1] 0.5000733
Group "a" has a mean of .2 with spread .1, and group "b" has a mean of .8 with spread .8. The resulting combined
mean of both groups is .50 - either by collapsing across the groups to average all observations, or by simple averaging
of the two group means. The two group sub-means are approximately .20 and .80. Plotting both distributions on the same plot:
> par(new=FALSE)
> plot(density(b.1.pop[1:(pop.n/2)]), ylim=c(0,4), xlim=c(-2, 4),
+ main="Population Density", xlab="")
> par(new=TRUE)
> plot(density(b.1.pop[(pop.n/2+1):pop.n]), ylim=c(0,4),
+ xlim=c(-2, 4), main="Population Density",xlab="")
Next, we create our population model with skewed residuals:
> x.obs.pop<-b.1.pop.bias
> pop.residual<-sort(sn.draw)
>
> pop.residual.a<-pop.residual[1:(pop.n/2)]
> pop.residual.b<-pop.residual[(pop.n/2+1):pop.n]
>
> pop.residual.a<-sample(pop.residual.a, pop.n/2, replace=TRUE, prob=sn.prob[1:(pop.n/2)])
> pop.residual.b<-sample(pop.residual.b, pop.n/2, replace=TRUE, prob=sn.prob[(pop.n/2+1):pop.n])
> pop.residual<-c(pop.residual.a, pop.residual.b)
> pop.residual<-pop.residual-mean(pop.residual)
>
> summary(pop.residual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.4460 -0.2663 -0.1716 0.0000 0.3021 1.2150
> hist(pop.residual
The population predictor is set to "b.1.pop.bias". We sort the skewed normal residual to create two different set of residuals.
Residuals are randomly sampled from within these two sub-groups to produce residuals for the outcomes of the two subgroups.
Lastly, we mean centered our population residuals:
Group "a" with location .2 and spread .1 will have less disturbance, and group "b" with location .8 and
spread .8 will have more disturbance. We generate our population model with intercept 1 and a fixed
regression coefficient of .8.
> y.obs.pop<- 1 + .80*x.obs.pop + pop.residual
hist(y.obs.pop)
We can observe the multiple "bumps" in the histogram. The hope is that with large enough sample sizes, that the intercept
of 1 and fixed regressor of .8 will be captured by the modeling of the sample data. We gather up our generated population
data into a data frame and plot our torpedo:
> pop.df<-data.frame(y.obs.pop, x.obs.pop)
> plot(pop.df$x.obs.pop, pop.df$y.obs.pop)
What does the population model values look like?
> mean(1+pop.residual)
[1] 1
>
> library(lm.beta)
> pop.fit<-lm(pop.df$y.obs.pop ~ pop.df$x.obs.pop)
> summary(lm.beta(pop.fit))
Call:
lm(formula = pop.df$y.obs.pop ~ pop.df$x.obs.pop)
Residuals:
Min 1Q Median 3Q Max
-3.3622 -0.2148 -0.1159 0.2378 1.4958
Coefficients:
Estimate Standardized Std. Error t value Pr(>|t|)
(Intercept) 0.8886341 0.0000000 0.0005851 1519 <0.0000000000000002 ***
pop.df$x.obs.pop 1.0226991 0.8958477 0.0007174 1426 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3268 on 499998 degrees of freedom
Multiple R-squared: 0.8025, Adjusted R-squared: 0.8025
F-statistic: 2.032e+06 on 1 and 499998 DF, p-value: < 0.00000000000000022
> coefficients(summary(lm.beta(pop.fit)))
Estimate Standardized Std. Error t value Pr(>|t|)
(Intercept) 0.8886341 0.0000000 0.0005851081 1518.752 0
pop.df$x.obs.pop 1.0226991 0.8958477 0.0007174070 1425.549 0
Analyzing all 500000 cases in the population produces a reversal in the estimates. The intercept should be 1 and the population
regression coefficient should be close to .80. A residual plot indicates that "all is not well".
Of course, one might contend: "We would see these irregularities in the data and compensate by predictor transformation, or
by other means.". Let's see what possible sample would look like when taken n=100 at a time when sampling from our large
finite population:
The smaller sub-samples obscure the larger deterministic data-generating mechanism such that the sub-samples
don't look so bad. However, each sub-sample reflects the overall highly biased estimate.
> unstd.lm.fit<-lm(y.obs~x.obs, data=sim.data)
> summary(unstd.lm.fit)
Call:
lm(formula = y.obs ~ x.obs, data = sim.data)
Residuals:
Min 1Q Median 3Q Max
-2.03356 -0.15934 -0.07526 0.21984 1.26029
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.79210 0.05089 15.56 <0.0000000000000002 ***
x.obs 1.17333 0.06524 17.98 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3783 on 98 degrees of freedom
Multiple R-squared: 0.7675, Adjusted R-squared: 0.7651
F-statistic: 323.4 on 1 and 98 DF, p-value: < 0.00000000000000022
Our goal seems clear at this point: How do we "discover" the hidden groupings, take into account the non-normality
of the residuals, and take into account the heterogeneity in the subgroups (unequal spread)? Next time, we'll look at
various approaches to this problem, and I promise: we will be able to recapture the population structure using l
east squares – we will not let a torpedo sink this ship.