STATISTICAL METHODS AND SOFTWARE FOR THE
ANALYSIS OF OCCUPATIONAL EXPOSURE DATA WITH
NON-DETECTABLE VALUES
Edward L.
Frome
Computer
Science and Mathematics Division
Oak Ridge
National Laboratory
Paul F.
Wambach
U. S.
Department of Energy
Date
Published: September 2005
ABRIDGED HTML VERSION CONTAINS: SECTIONS IN RED
Appendix Contains some errors in
notation---see original PDF Version
CONTENTS
ABSTRACT ………………………………………………………………………………
1.
INTRODUCTION
…….………………………………………………….………….. 1
2. STATISTICAL ANALYSIS FOR COMPLETE SAMPLES
………………………
2
2.1 CONFIDENCE LIMITS FOR
THE MEAN EXPOSURE LEVEL ……..…. .
3
2.2 CONFIDENCE LIMIT FOR PTH
PERCENTILE ……………………………
3
2.3 CONFIDENCE LIMITS FOR
EXCEEDANCE FRACTION…………………
4
3. ANALYSIS
OF DATA WITH NON-DETECTS ……………………………………
5
3.1 MAXIMUM LIKLIHOOD
ESTIMATEION FOR LOGNORMAL DATA
WITH
NON-DETECTS ………………………………………………………….
5
3.2
CONFIDENCE LIMITS FOR THE
MEAN EXPOSURE LEVELS
WITH
NON-DETECTS ………………………………………………………….
6
3.3
CONFIDENCE LIMITS FOR THE
PTH PERCENTILE WITH
NON-DETECTS …………………………………………………………………..
7
3.4
CONFIDENCE LIMITS FOR
EXCEEDANCE FRACTIONS
WITH NON-DETECTS…………………………………………………………..
7
3.5
NON-PARAMETRIC METHODS FOR
SAMPLES WITH
NON-DETECTS………………………………………………………………….. 8
3.6
NON-PARAMETRIC UPPER
TOLERANCE LIMIT AND
EXCEEDANCE
FRACTION……………………………………………………
9
4.
APPLICATIONS………………………………………………………………………
9
4.1 EXAMPLE 1. SURFACE WIPE SAMPLES FROM
ELEVATED ………..
10
SEMELTER
SURFACES
4.2 EXAMPLE 2. TWA BERYLLIUM EXPOSURE DATA
……………………..
12
5.
DISCUSSION ………………………………………………………………………….
15
6.
ACKNOWLEDGEMENTS …………………………………………………………..
19
REFERENCES
……………………………………………………………………..……..
20
APPENDIX
…………………………………………………………………….………….
22
ABSTRACT
Environmental exposure measurements
are, in general, positive and may be subject to left censoring; i.e,. the
measured value is less than a “detection limit.” In occupational monitoring, strategies
for assessing workplace exposures typically focus on the mean exposure level or
the probability that any measurement exceeds a limit. Parametric methods used to determine
acceptable levels of exposure, are often based on a two parameter lognormal
distribution. The mean exposure level, an upper percentile, and the exceedance
fraction are used to characterize exposure levels, and confidence limits are
used to describe the uncertainty in these estimates. Statistical methods for random samples
(without non-detects) from the lognormal distribution are well known for each of
these situations. In this report,
methods for estimating these quantities based on the maximum likelihood method
for randomly left censored lognormal data are described and graphical methods
are used to evaluate the lognormal assumption. If the lognormal model is in doubt and
an alternative distribution for the exposure profile of a similar exposure group
is not available, then nonparametric methods for left censored data are used.
The mean exposure level, along with the upper confidence limit, is obtained
using the product limit estimate, and the upper confidence limit on an upper
percentile (i.e., the upper tolerance limit) is obtained using a nonparametric
approach.
All of these methods are well known
but computational complexity has limited their use in routine data analysis with
left censored data. The recent
development of the R environment for statistical data analysis and graphics has
greatly enhanced the availability of high-quality nonproprietary (open source)
software that serves as the basis for implementing the methods in this
paper. Numerical examples are
provided and R(2004) functions are available at the analysis of occupational
exposure data web site (AOED) .
INTRODUCTION
Regulatory and advisory criteria
for evaluating the adequacy of occupational exposure controls are generally
expressed as limits that are not to be exceeded in a work shift or shorter
time-period if the agent is acutely hazardous. Exposure monitoring results above the
limit require minimal interpretation and should trigger immediate corrective
action. Demonstrating compliance with a limit is more difficult.
The American Industrial Hygiene
Association (AIHA) has published a consensus standard with two basic strategies
for evaluating an exposure profile [Mulhausen and Damiano (1998)]. The first approach is based on the mean
of the exposure distribution, and the second approach considers the “upper tail”
of the exposure profile.
Statistical methods for estimating the mean, an upper percentile of the
distribution, the exceedance fraction, and the uncertainty in each of these
parameters are described. Most of
the AIHA methods are based on the assumptions that the exposure data does not
contain non-detects, and that a lognormal distribution can be used to describe
the data. Exposure monitoring results from a compliant workplace tend to contain
a high percentage of non-detected results when the detection limit is close to
the exposure limit, and in some situations, the lognormal assumption may not be
reasonable. There are parametric
methods for censored lognormal data and non-parametric methods that can be used
with left censored data to calculate all of the statistics recommended by the
AIHA for the complete data case. However, the only practical way to compute
these statistics is with statistical software. The recent availability of free,
high-quality statistical software means that complex calculations are no longer
a barrier to the statistical analysis of almost any occupational exposure data
set. This also eliminates the need
for special tables and graphical methods that are used in the complete data case
for the lognormal distribution.
Statistical methods for the
analysis of right censored data using various parametric and non-parametric
methods are well known and generally referred to as “survival analysis” [Cox and
Oakes (1984) or Kabfleish and Prentice (1980)]. In this situation, the dependent or
response variable (say T) is usually time to the occurrence of event; i.e., the
“survival time” (or time to failure) of an observational or experimental unit
(e.g., animal, person, or machine).
T may be referred to as a “lifetime random variable” and is by definition
positive, and may be subject to “censoring.” As a typical example, let Ti
represent the survival time of the ith patient in a clinical trial. If the trial ends and the patient is not
known to have “failed” the observed survival time, say
, is right censored (i.e., it is only known that
Ti is greater than ). This can
occur for several reasons. Suppose,
for example, that all patients enter the trial at the same time and are followed
until a specified end date, then those individuals still at risk have a censored
survival time that is the same for all surviving patients (Type I
censoring). If patients enter the
trial at random and the trial ends at a fixed date, then the value of is different for
each surviving patient (random censoring).
Statistical methods for the analysis of right censored data are widely
used and computer software for survival analyses is available in most general
purpose statistical programs
(e.g., the R survival
library).
In this report, the dependent or
response variable of interest is the amount, say X, of a measured quantity. X is a positive random variable and as
the result of the analytic methods used, the observed value for the ith measurement may be reported as
(left) “censored” and is referred to as a non-detect or as being less than a
“detection limit”(DL) say (i.e., it is
only known that is less than
). A
frequent assumption is that the distribution of X is lognormal . See Aitchison
and Brown(1969) and Crow and
Shimizu(1988) for general treatment of the lognormal distribution and its
application. Schmoyer et. al.
(1996) considered the lognormal model for contaminant concentrations in
environmental risk assessment for both complete and left censored samples. Akritas et. al. (1994) provide a
detailed discussion of various methods that have been proposed for parameter
estimation for left censored data.
Methods were classified into three general areas as described by Helsel
(1990). They are: (1) simple
substitution; e.g., replace a censored observation with one-half of its value;
(2) parametric methods; e.g., censored data maximum likelihood (ML), and (3)
“robust parametric methods” based on variations of “probability plot regression.” Simulations studies have been done to
compare these methods under various conditions with the primary focus on bias
and mean square error of location and scale parameters [see Helsel and
Cohen(1988), Newman et. al. (1989)].
Akritas et. al. (1994) have reviewed these and other studies, and note
that ML methods under the lognormal model provide expressions for the variance
of the parameter estimates. This is
important when an upper percentile, say Xp , of the exposure distribution is of
interest. For example, the ML
estimate of log(Xp) is a linear combination of the lognormal
parameters, and the standard error of this quantity can be estimated from the ML
parameter covariance matrix.
Consequently, confidence limits for both Xp and the exceedance
fraction can be obtained using the ML approach. Taylor et. al. (2001) have noted
that regarding all non-detected values as censored outcomes from a lognormal
distribution may not always be appropriate. If there is reason to believe that a
non-negligible proportion of the non-detects are “true zero” exposures, then a
censored lognormal mixture model (a zero-inflated lognormal model with
censoring) should be considered. ML
methods for estimation and hypothesis testing are described and the relationship
between ML parameter estimates from the mixture model and those based on either
a left truncated or censored lognormal model are described. Moulton and Halsey(1995) emphasize that
it is also possible that non-detects may be from a second (possibly lognormal)
distribution rather than a point mass at zero. Fowlkes (1979) has described methods for
studying the mixture of two lognormal distributions, although he did not
consider left censored data. In
situations where the lognormal model (or some other distribution such as the
gamma or Weibull) is not reasonable, a non-parametric approach can be used. All of the methods just described can be
implemented using R, the ML method being the most difficult. It is also possible to develop
procedures for all of these methods in several proprietary statistical programs
that are available commercially.
In the discussion that follows the
generic term “acceptable” refers to the situation where the distribution of the
exposure measurement X satisfies a specified criteria indicating, for example,
that the workplace is “safe,” or that the surfaces of a survey unit are
“clean.” The term survey unit
describes all or part of an entity (e.g., building, piece of equipment) that is
being evaluated. The term
“unacceptable” means that the distribution of the exposure measurements
indicates that the workplace or survey unit is “contaminated” or
“hazardous.” The formal statistical
procedures used to demonstrate that an exposure distribution is acceptable is to
state a null hypothesis in the form H0 : q ³ L, where q represents a parameter of the
exposure distribution (e.g., the mean, a percentile, or the exceedance
fraction), and q ³ L indicates that the exposure
distribution is unacceptable. Then,
based on a random sample from the exposure distribution, an estimate of
q and an upper confidence limit
(UCL) with a specified confidence level, say g, are calculated. If the 100g%UCL is less than L, then the null
hypothesis is rejected and the exposure distribution is acceptable. A Type I error occurs if H0
is rejected when it is true (i.e., the X distribution is incorrectly considered
to be acceptable). This will occur
with a probability (type I error rate) that is less than or equal to
a = (1 - g), with a = 1 - g when q = L. These and other related
procedures are described in detail in an occupational exposure context by
Mulhausen and Damiano (1998).
4. APPLICATIONS
In several situations of practical interest
statistical analysis of left censored data from a lognormal distribution are
required. The "exact" results for
complete samples described in Section 2 have not been developed for censored
data. The methods presented here are "large sample" results and follow
directly from the properties of ML estimators described in Section 3. Each of the examples will describe the
censored data equivalent of the exact methods used with complete samples.
The emphasis here is on describing the methods and software. The focus in
the examples is two areas of application that are part of the Department of
Energy (DOE) Chronic Beryllium Disease Prevention Program. The DOE is concerned with monitoring
objects (e.g., equipment, buildings) for beryllium contamination and workers for
exposure to beryllium in the workplace.
The first example describes the results of a survey to evaluate possible
beryllium “contamination” based on surface wipe sampling of a smelter facility
used to recycle metal. The second
example describes the results of a beryllium worker-monitoring program using
8-hour TWA. In both situations
“limit values” have been established to determine if exposure levels are
acceptable; i.e., the object is “clean” or the workplace is “safe.” In general, the limit value will depend
on the strategy that is being used as described in the introduction. In the examples the null hypothesis of
interest is that the 95th percentile of exposure distribution does
not exceed the specified limit.
Hewett (1996) explains that occupational exposure limits (OELs) are
generally single shift limits used for day-to-day risk management that will also
constrain long-term, working lifetime exposures of each individual worker to
protective levels. OELs are based
on health or toxicology studies that establish protective mean exposure
levels. A work environment that
rarely exceeds the OEL will also maintain mean levels well below the OEL. Day-to-day exposure prevention is
achieved through investigations to determine cause and corrective actions for
exposure measurements above the OEL.
Ninety-five-percent confidence that fewer than 5% of
measurements are above a specified limit is a statistical definition of
compliance that has come into widespread use to determine the monitoring efforts
needed to demonstrate compliance [see chapter 7, Mulhausen and Damiano
(1998)]. In this
situation the upper tolerance limit (i.e., the 95% UCL for the 95th
percentile) and the UCL for the exceedance fraction are of primary interest (see
Section 2.3) to determine if the exposure distribution is acceptable. The exact results for samples from a
lognormal (or normal) distribution described in Section 2 and the Appendices of
Mulhausen and Damiano(1998) are
based on the assumption of complete samples; i.e., no left censored data. The statistical methods and
computer software for the analysis of left censored data described in Section 3
can be used to calculate the censored data equivalent of all of the statistics
described by Mulhausen and Damiano (1998).
Details describing R and the R driver functions used to obtain these
results are described in the Appendix and at the AOED website. All of the R functions can be downloaded
from the AOED web site.
In the examples, results obtained using R
interactively are shown in a monospaced font
like this (where “>” is the R prompt). To
duplicate these results in the examples read the Appendix and then visit the
AOED web site and complete steps 1-4. Note that Exhibit 1 is listed at
the end of readall.R at the AOED web site, and the data frame SESdata and
character string IpSESdata will be in the R working
directory.
4.1 Example 1. Surface Wipe Samples from Elevated
Semelter Surfaces
The data in Exhibit 1 are 31 surface wipe
samples from elevated surfaces of a smelter with beryllium contamination. Exhibit 1 illustrates one
method that can be used to enter data into R in the format required for the ML
estimation. This would
normally be done by using a text editor to create a file (example1.txt). All characters on a line to the right of
the # sign are comments. The data
is entered into the R working directory
using the R function source; i.e, if the file is in the
directory(folder) where the R session was initiated, source
(“example1.txt”) will input the vectors x, det, and the data.frame
SESdata.
Exhibit 1 of Section 4.1
# Illustrates one method that
can be used to enter
# data into R in the format
required for ML estimates
#
# Surface
Wipe Samples from Smelter
IpSESdata <- "SESdata:
Smelter-Elevated Surfaces "
# IpSESdata is Character
string for Use by qqlognB()
x <-
(15,15,15,25,25,40,40,40,45,50,50,70,75,95,100,125,125,
145,145,150,150,165,270,290,345,395,395,420,495,840,1140)
x <-
x/1000
# wipe samples micrograms per 100 cm^2
det<-
c(0,0,0,rep(1,28) ) # first
three values are censored
SESdata<- data.frame(x,det) # R data frame for
mlndln()
#
ML estimates of µ, σ,
log() , and σ2 are obtained using
:
>
mlndln(SESdata)
mu
sigma
logE
sig2
-2Log(L) Conver
mle -2.2907643 1.2760000 -1.4766777
1.6281796 -12.852885390
0
semle 0.2311395 0.1754489 0.3137301 0.4477474 -0.002005525
28
The R function mlndln is described in the
Appendix and is available at the AOED website. The data in Exhibit 1 are shown graphically in Fiqure
1. ML estimates of µ, and σ are
shown in title of the plot. To
obtain Figure 1, use the following at the R prompt:
>qqlognB(SESdata,IpSESdata,L=0.2,unit
= “mug/100 cm^2”,p=0.95,gam=0.95)
If equipment is being evaluated for
release to the public or for non beryllium use the DOE has established a release
limit for removable beryllium contamination of L95 = 0.2 µg/100cm2. The ML estimate of the
Figure 1. Results for Surface Wipe Samples in
Example 1.
exceedance fraction (see Section
2.3), 95% LCL, and 95% UCL are obtained using R function
efclnd based on the method described in
Section 3.4;
>
efclnd(SESdata,gam=0.95,L=0.2)
f_MLE LCL_0.95
UCL_0.95
29.66864
19.45963 41.80762
The nonparametric estimates
of
the exceedance
fraction, 95% LCL, and 95% UCL are
obtained using efclnp based on the method described in
Section 3.5:
>
efclnp(SESdata,gam=0.95,L=0.2)
fnp LCL_95 UCL_95
29.03226
16.06111 45.19044.
The
exceedance fraction is an estimate of the percentage of surface area that is
expected to exceed the release limit Lp = 0.2 µg/100cm2
with p =
0.95. Both the
point estimate and the UCL for F exceed Fo=100( 1-p)= 5%, indicating that the equipment is not
acceptable. In fact, the 95% LCL
indicates that at the 95% confidence level at least 19.5% of the surface area
exceeds the release limit.
These lognormal based and nonparametric estimates of the exceedance
fraction and 95% CLs are shown in the lower right of Figure 1 along with the
lognormal based estimate of the 95th percentile Xp = 0.825, the lower 95% CL =
LX(0.95,0.95) = 0.446, and upper 95% CL= UX(0.95,0.95)= 1.526 (see Section 3.3). The GM, GSD, ML estimates of the
(arithmetic) mean of X (with
confidence limits) based on the lognormal model, the distribution free
Kaplan-Meier mean (with confidence limits), the percent non-detects, the sample
size (n), the number of detects (m),
R2 (as defined in
Section 3.5), and the z value for the limit Lp are in the upper left
corner of Figure 1.
>
allss(dd=SESdata,L=0.2,p=0.95,gam=0.95)
sstat
Sec
mu
-2.291 ML estimate of mean of y=log(x) Sec
3.1
se.mu 0.231 Estimate
of standard error of mu Sec 3.1
...
Fnp_0.2 29.030 Nonparmetric estimate of F
for limit L Sec 3.6
FnLCL_95 16.060 Nonparmetric estimate of LCL
for F Sec 3.6
FnUCL_95 45.190 Nonparmetric estimate of UCL
for F Sec 3.6
>
4.2 Example 2. TWA Beryllium Exposure
Data
As part of a chronic disease
prevention program the DOE adopted an 8-hour TWA OEL limit value of 0.2
micrograms per cubic meter proposed by the American Conference of Government
Industrial Hygienists (DOE 10 CRF Part 850 and ACGIH 2004). Figure 2 summarizes the results of 280
personal 8-hour TWA beryllium exposure readings at a DOE facility. This data contains 175 non-detects that
range in value from 0.005 to 0.100 µg/m3. This is an example of random
(progressive) left censored data (available at the AOED web site in file
beTWA.txt). The q-q plot in Figure
2 is based on the PLE as described in Section 3.5 and was calculated using the R
function plend(beTWA). Figure 2 can be obtained using R utility
function qqlognB. To obtain Figure 2, use the following at
the R prompt:
>beTWA
<- read.table(“beTWA.txt”)
>qqlognB(beTWA,IpbeTWA,L=0.2,
unit = “mug/m^3”).
ML
estimates of µ, σ, log() , and
σ2 are obtained using :
>
mlndln(beTWA)
mu
sigma
logE
sig2
-2Log(L) Conver
mle -5.1786787 1.5357165 -3.9994324
2.3585614 -2.175955e+02
0
semle 0.1340638 0.1155163 0.1485077 0.3548366 -8.918476e-03 105
>
The ML estimate of the 95-95 geometric upper
tolerance limit is calculate using the results in Section 3.3 equation (9),
i.e.= + z.95 -2.652 and
var() |
= |
Var() + var ()
+ 2zpcov(,) |
|
= |
0.13412
+ 1.6452(0.1155)2 + 2*1.645(-.008918) |
|
= |
0.0247 |
Then, using equation 6 ,
UX(0.95,0.95) = exp[-2.652 + 1.659 (0.0247)1/2] = 0.091. The nonparametric upper tolerance limit
from Section 3.6 is 0.13. Both
estimates are well below the limit
L0.95 = 0.2 µg/m3 indicating that
the workplace acceptable (in compliance).
The lognormal based estimate of the
exceedance fraction for L0.95
= 0.2 is 1.01%, and the 95% upper confidence limit Uf(0.2,0.95) =
3.24% < 5% indicating the
workplace is acceptable. The
non-parametric estimate of the exceedance fraction is 1.43% with 95% UCL =
3.24%. All of the above results
provide strong support for the decision that the workplace is in
compliance.
Figure 2. Results for 8-hour TWA Data in Example
2.
APPENDIX
R (2004) is available as Free
Software under the terms of the Free Software Foundation's
GNU General Public License in
source code form. It compiles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows and MacOS.
Detailed documentation on all aspects of R is available at the R
home page http://www.r-project.org/.
Sources, binaries, and documentation for R can be obtained via
CRAN, the “Comprehensive R Archive Network”
(click on CRAN on the R home
page menu ). An Introduction to R
and related manuals edited by the R
Development Core Team are
provided under the “Manuals” link.
Additional manuals, tutorials, etc. are provided by users of R under the
“Contributed” Link.
References are provide under the “Publications” link. [Venables and
Ripley(2002)] is a highly recommended book on statistical data analysis using
R. All of the R functions discussed
in this report and the data used in the examples in Section 4 are available at
the website http://www.csm.ornl.gov/esh/aoed (AOED). Most of the serious computing is done by
R base functions optim and uniroot. The R driver functions at AOED are
provided to assist the reader that may not have experience with R. They are not “formal” R functions, i.e.
there is limited error checking and the usual type of R online “help” files are
not provided. Documentation for
each function is provided in this report and as comments in the function. All of the files at the AOED web site
are ascii (txt) files and can be modified using any text editor (e.g. xemacs,
wordpad, vi). The most
important functions with more detailed documentation are combined into one file
oedmain.R (Exhibit 3). Additional
functions that reflect the authors’ interest and that may require revisions for
other applications are also provided in the file oedutil.R (Exhibit 4). Both of these are available at the AOED
web site the AOED website.
In Section 4 several examples
of the interactive use of these functions for the analysis of left censored data
were provided. These functions can
be used for complete data sets by providing an input matrix with the data in
column 1 and the censoring indicator for each data value (all equal to 1) in
column 2. The results will be
the ML estimates. For
complete data sets when n is small the “exact “ results described in Section 2 may be of
interest. The R function
lnexact in Exhibit 2 illustrates the use of the functions (see Exhibit 3)
extol and efcl for the exact analysis of complete samples. It is
used here to provide an introduction to R (see below). These functions are appropriate for
complete data sets when the “upper tail” of the lognormal distribution is of
interest (see Mulhausen and Damiano, 1998 Appendix VII). In this situation the industrial
hygienist picks an upper percentile Xp (often the 95th percentile)
and specifies a limit Lp (e.g., the OEL). The value of p is the minimum
proportion of the exposure distribution that must fall below the limit
Lp for the exposure profile to be considered “acceptable.” The exact 100g% upper confidence limit for the pth percentile Xp
is UX(p, g) = exp(+ K sy) and is referred to as the upper tolerance
limit ( see Section 2.2). The exact 100g% lower confidence limit for Xp is LX(p,
g) = exp(+ K΄sy).
The point estimate of the exceedance fraction FL and
100g% lower and upper confidence limits are calculate using
efcl(x,gam,L) as described in Section 2.3. The Sapiro-Wilk W statistic recommended
by AIHA can be used on complete data sets using R function shapiro.test
for n between 3 and 5000.
To duplicate the results that
follow visit the AOED web site and complete steps 1-4. The R working directory AIHD will
contain all of the functions described in this report and data frames
aiha (example data from Mulhausen and Damiano, 1998, page 259) and
aihand (example data from Mulhausen and Damiano, 1998, page 244 with 3
non-detected values). Note that
aihand is used as the default data set in all functions that require a
two column matrix or data frame (e.g., mlndln). The input to lnexact(x,p,gam,L)
in Exhibit 2 is a vector x of positive data xi, i = 1,…, n , and the values of p, gam(γ), and L. The first line is the function name and
arguments. Lines 2 –18 are comments
that describe what the function does, the arguments, and the values
returned. All characters on a line
to the right of the # sign are comments. In a formal R function this information
is obtained from a “help” file by entering ? followed by the name of the
function; e.g, typing pt at the R prompt ( the symbol >)
describes the probability density function for Student’s t distribution that is
used by extol and efcl.
Line 20 describes the error check on line 21. The mean and standard deviation of
y = log(x) and sample size are calculated on line 23 ( note that the semicolon
separates executable statements on the same line). The next 3 lines calculate the
point estimate and confidence limits for Xp using extol to calculate
the values of K and K´. Line 27 combines the three values
into a vector xp and names it np. Line 28 uses efcl to calculate
the point estimate and confidence limits for the exceedance fraction, and the
next line combines vector xp and fp into a data frame with names
based on the input values of p, gam, and L. These functions can be revised using
the R functions fix or edit, or by using a text editor (e.g.,
Notepad or XEmacs) to make a new file.
Exhibit 2 in the Appendix
lnexact
<-function(x=aiha[,1],p=0.95,gam=0.95,L=5){
# Find point estimates and
confidence limits for Xp the pth
# percentile and the
exceedance fraction F_L = Pr[ x > L]
# for a (complete) sample from a lognormal
distribution
# USAGE:
lnexact(x,p,gam,L)
#
ARGUMENTS:
# x is a vector of
positive lognormal data
# p is proportion that
should fall below L
# gam is the one-sided
confidence level
# L is the specified
limit of interest
# VALUE: data.frame
containing point estimates
# of Xp and F_L in
column 1
# 100gam% lower
confidence limits in column 2
# 100gam% upper
confidence limits in column 3
# Xp LX(p,gam)
UX(p,gam)
# F_L Lf(L,gam)
Uf(L,gam)
# NOTE: The combined
confidence limits are an approximate
# 100*[1 -
(1-gam)*2] Percent Confidence Interval
#
# The next line is an
example of "error checking"
if( any(x) <= 0 ) stop("all data
values must be positive")
# calculate mean(yb) and
standard deviation(sy) of y=log(x)
yb <- mean(log(x)) ; sy<- sd(log(x) ) ; n <-
length(x)
xp <- exp( yb +
qnorm(p)*sy ) #
point estimate of Xp
lxp <- exp( yb +
extol(n,p,1-gam)*sy) # exact LCL for Xp
uxp <- exp( yb +
extol(n,p,gam)*sy ) # exact UCL for
Xp
xp <- c(xp,lxp,uxp) ; nx <- paste("X",100*p,sep="")
fl <- efcl(x,gam,L,T) ;
nf <- paste("F",L,sep="_")
out<- rbind(xp,fl)
dimnames(out)<- list(
c(nx,nf),c("Est","LCL","UCL"))
out
}
The following script demonstrates the use of lnexact
>
# Use data from Hewett and
Ganser (HG) 1997 page 135 to
>
# illustrate use of lnexact
to obtain exact confidence
>
# limits for Xp percentile
and the exceedance fraction
>
# for complete samples from
lognormal distribution the
>
# next line assigns the data
to the vector variable xhg
>
>
xhg<- c(4.25,1.38,3.11,2.20,2.82)
>
>
# now use function lnexact with xhg as input
>
>
lnexact(xhg,p=0.95,gam=0.95,L=5)
Est LCL.95 UCL.95
X95
5.145787 3.6328368 15.10336
F_5
5.744611 0.3795139 35.55304
>
>
# HG results--- F_5= 5.7 LCL=
0.38 UCL=
35.55
>
#
--- X95 not
calculated
>
>
# Next use aiha[,1] to check results in Appendix VII
>
# of Mulhausen and Damiano, 1998
(DM)
>
Ipaiha
[1]
"Strategy for Assessing & Managing Occ Exp page 259"
>
> lnexact(aiha$x,p=0.95,
gam=0.95)
Est
LCL
UCL
X95
4.842716 3.9015308
7.045908
F_5
4.241070 0.8570198 15.282684
>
>
>
# DM Appendix VII page 278-280
results UTL= 7.11
>
# F_5= 4.4% 95% LCL = 2% 95% UCL = 15%
>
# difference is due to
graphical interpolation and
>
# rounding of the mean and SD
of log(x)
>
mean( log(aiha[,1]) ) ; sd( log(aiha[,1]) )
[1]
0.9079064 # rounded to 0.91
[1]
0.4070693 # rounded t0 0.41
>
>
# DM Appendix VII page 277 mean= 0.91 SD = 0.41
>
The results from lnexact
will agree with those from Odeh and Owen for any reasonable values of p, gam,
and L. Note that if the value of L is changed to the 95% UCL for
Xp
>
>
# change L to 7.046 the 95% UCL for
X95
>
>
lnexact(aiha[,1],p=0.95, gam=0.95, L=7.046)
Est
LCL.95
UCL.95
X95 4.8427163 3.90153084
7.045908
F_7.046
0.5143435 0.02923116 4.999718
# and the 95% UCL for F is 5% as
expected
This shows the equivalance of
the relationship between the upper tolerance limit and the exceedance fraction;
i.e, with γ = 0.95 , p= 0.95, L=
7.046, F0 = 5%
the upper tolerance limit is
UX(0.95,0.95) = 7.04559 and
the upper confidence limit for the
exceedance fraction is Uf(7.046,0.95)= 4.99972
H0: Xp
³ L
reject if UX(p,
γ) < L
REJECT
H0
H0: FL
³ 1-p reject if Uf(L, γ)
< F0
REJECT H0
Likewise, if the
value p = 1 - UCLF_5/100 = .8472 is
used
>
# change p to 1 - 0.1528= 0.8472
>
>
lnexact(aiha[,1],p=0.8472, gam=0.95, L=5)
Est LCL.95 UCL.95
X84.72
3.76199 3.1313547
5.000301
F_5 4.24107 0.8570198
15.282684
>
# and the 95% UCL for Xp is 5 as
expected
This is equivalent to γ = 0.95
, p= 0.8472, L= 5, F0 = 15.29% , the upper tolerance limit
is UX(0.847,0.95)= 4.999157, and the upper confidence limit for the exceedance fraction is Uf(5,0.95) = 15.282684
H0: Xp
³ L
reject if UX(p, γ) <
L REJECT H0
H0: FL
³ 1-p reject if Uf(L, γ)
< F0
REJECT H0
Exhibit 3 in the Appendix oedmain.R
Exhibit 4 in the Appendix
oedutil.R .
REFERENCES
Aitchison, J. and
J.A.C. Brown, 1969, The Lognormal Distribution, Cambridge, U.K., Cambridge
University Press.
Akritas, M.G., T. F.
Ruscitti, and G.S. Patil, 1994.
Statistical Analysis of Censored Environmental Data, Handbook of
Statistics, Vol. 12, G.P Patil and C.R. Rao (eds), 221-242, Elsevier
Science , New
York.
American Conference
of Governmental Industrial Hygienists (ACGIH), Notice of Intended Change In:
2004 TLVs® and BEIs®, p. 60, ACGIH, Cincinnati, OH.
Armstrong, B. G., 1992. “Confidence Intervals for Arithmetic
Means of Lognormally Distributed Exposures,” American Industrial Hygiene Association
Journal, 53(8), 481-485.
Chambers, J. M., W. S.
Cleveland, B. Kleiner and P. A. Tukey, 1983. Graphical Methods for Data Analysis,
Duxbury Press, Boston.
Clopper, C. J. & Pearson,
E. S. (1934). The Use of Confidence or Fiducial Limits Illustrated in the Case
of the Binomial. Biometrika_, 26, 404-413.
Cohen, A. C,. 1991. Truncated and Censored Samples. Marcel
Dekker, Inc., New York.
Crow, E. L. and K. Shimizu,
1988. Lognormal Distribution,
Marcel Decker, New York.
Cox, D. R. and D. V. Hinkley, 1979. Theoretical Statistics. Chapman &
Hall, New York.
Cox, D.R. and
D. Oakes, 1984. Analysis of
Survival Data. Chapman and Hall, New York.
Department
of Energy, 10 CFR Part 850, Chronic Beryllium Disease Prevention Program,
Federal Register, Vol. 64, No. 235, 68854-68914, December
1999.
Fowlkes,
E. B. 1979. “Some Methods for Studying the Mixture of Two Normal (Lognormal)
Distributions,” Journal of the American Statistical Association, 74,
561-575.
Hahn,
G.J. and W.Q. Meeker, 1991. Statistical Intervals. John Wiley and Sons, New York.
Hewett, P. and G. H. Ganser,
1997. “Simple Procedures for
Calculating Confidence Intervals Around the Sample Mean and Exceedance Fraction
Derived from Lognormally Distributed Data,” Applied Occupational and Environmental
Hygiene, 12(2), 132-147.
Helsel, D.,1990. “Less Than
Obvious: Statistical Treatment of Date Below the Detection Limit”, Environmental Science and Technology,
24(12), 1767-1774.
Hesel, D.R, and T.A. Cohn
,1988, “Estimation of Descriptive Statistics for Multiply Censored Water Quality
Data”, Water Resoureces Research, 24, 1997-2004.
Johnson, N. L. and B. L. Welch,
1940. “Application of the Non-Central t-Distribution,” Biometrika, 31(3/4),
362-389.
Kalbfleisch, J.D. and R.L. Prentice, 1980. The Statistical Analysis of Failure Time
Data. John Wiley and Sons, New
York..
Kaplan, E. L. and P. Meir,
P.,1958. “Nonparametric Estimation
from Incomplete Observations,” Journal of
the American Statistical Association, 457-481.
Land, C. E., 1972. “An Evaluation of Approximate Confidence
Interval Estimation Methods for Lognormal Means,” Technometrics, 14(1),
145-158.
Meeker, W.Q and L.A. Escobar,
1998. Statistical Methods for Reliability Data. John Wiley and Sons, New
York.
Moulton, L.H. and N.A. Halsey,
1995. “A Mixture Model with
Detection Limits for Regression Analysis of Antibody Response on Vaccine,”
Biometrics, 51, 1570-1578.
Mulhausen, J. R. and J.
Damiano, 1998. A Strategy for Assesing and Managing
Occupational Exposures, Second Edition, AIHA Press, Fairfax,
VA.
Neuman, M.C., P.M. Dixon, B.B. Looney,and J.E. Pinder, 1989,
“Estimating Mean and Variance for Environmental Samples with Below Detection
Limit Observations, Water Resources Bulletin,25, 905-916.
Odeh, R. E. and D. B. Owen, 1980. Tables for Normal Tolerance Limits, Sampling Plans, and Screening, Marcel
Deker, New York.
R Development Core Team (2004). R: A language
and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-00-3, URL http://www.R-project.org.
Schmee,
J., D. Gladstein and W. Nelson, 1985.
“Confidence Limits for
Parameters of a Normal Distribution From Singly Censored Samples,
Using
Maximum Likelihood.” Technometrics,
27, 119-128.
Schmoyer, R. L., J. J.
Beauchamp, C. C. Brandt and F. O. Hoffman, Jr., 1996. “Difficulties with the Lognormal Model
in Mean Estimation and Testing,” Environmental and Ecological Statistics,
3, 81-97.
Sommerville, P. N., 1958. “Tables for Obtaining Non-Parametric
Confidence Limits,” Annals of
Mathematical Statistics, 29, 599-601.
Taylor, D. J., L. L. Kupper, S.
M. Rappaport, and R. H. Lyles, 2001. “A Mixture Model for Occupational Exposure
Mean Testing with a Limit of Detection”,
Biometrics, 57, 681-688.
Tuggle, R. M., 1982. “Assessment of Occupational Exposure
Using One-Sided Tolerance Limits,” American Industrial Hygiene Association
Journal, 43, 338-346.
Turnbull, B. W., 1976. “The
Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated
Data,” Journal of the Royal Statistical
Society, Series B (Methodological), 38(3), 290-295.
Venables, W. N. and B. D.
Ripley, 2002. “Modern Applied Statistics with S,”
4th edition.
Springer-Verlag, New York.
Verrill, S. and R. A. Johnson,
1998. “Tables and Large-Sample
Distribution Theory for Censored-Data Correlation Statistics for Testing
Normality,” Journal of the American
Statistical Association, 83(404), 1192-1197.
Waller, L. A., and B. W.
Turnbull, 1992. “Probability Plotting with Censored Data,”
The American Statistician, 46(1), 5-12.