10  Resampling Methods

10.1 Introduction

So far in this textbook we have discussed two approaches to inference: exact and asymptotic. Both have their strengths and weaknesses. Exact theory provides a useful benchmark but is based on the unrealistic and stringent assumption of the homoskedastic normal regression model. Asymptotic theory provides a more flexible distribution theory but is an approximation with uncertain accuracy.

In this chapter we introduce a set of alternative inference methods which are based around the concept of resampling - which means using sampling information extracted from the empirical distribution of the data. These are powerful methods, widely applicable, and often more accurate than exact methods and asymptotic approximations. Two disadvantages, however, are (1) resampling methods typically require more computation power; and (2) the theory is considerably more challenging. A consequence of the computation requirement is that most empirical researchers use asymptotic approximations for routine calculations while resampling approximations are used for final reporting.

We will discuss two categories of resampling methods used in statistical and econometric practice: jackknife and bootstrap. Most of our attention will be given to the bootstrap as it is the most commonly used resampling method in econometric practice.

The jackknife is the distribution obtained from the n leave-one-out estimators (see Section 3.20). The jackknife is most commonly used for variance estimation.

The bootstrap is the distribution obtained by estimation on samples created by i.i.d. sampling with replacement from the dataset. (There are other variants of bootstrap sampling, including parametric sampling and residual sampling.) The bootstrap is commonly used for variance estimation, confidence interval construction, and hypothesis testing.

There is a third category of resampling methods known as sub-sampling which we will not cover in this textbook. Sub-sampling is the distribution obtained by estimation on sub-samples (sampling without replacement) of the dataset. Sub-sampling can be used for most of same purposes as the bootstrap. See the excellent monograph by Politis, Romano and Wolf (1999).

10.2 Example

To motivate our discussion we focus on the application presented in Section 3.7, which is a bivariate regression applied to the CPS subsample of married Black female wage earners with 12 years potential work experience and displayed in Table 3.1. The regression equation is

log( wage )=β1 education +β2+e.

The estimates as reported in (4.44) are

log( wage )=0.155 education +0.698+e^ (0.031) (0.493)σ^2=0.144 (0.043) n=20

We focus on four estimates constructed from this regression. The first two are the coefficient estimates β^1 and β^2. The third is the variance estimate σ^2. The fourth is an estimate of the expected level of wages for an individual with 16 years of education (a college graduate), which turns out to be a nonlinear function of the parameters. Under the simplifying assumption that the error e is independent of the level of education and normally distributed we find that the expected level of wages is

μ=E[ wage  education =16]=E[exp(16β1+β2+e)]=exp(16β1+β2)E[exp(e)]=exp(16β1+β2+σ2/2).

The final equality is E[exp(e)]=exp(σ2/2) which can be obtained from the normal moment generating function. The parameter μ is a nonlinear function of the coefficients. The natural estimator of μ replaces the unknowns by the point estimators. Thus

μ^=exp(16β^1+β^2+σ^2/2)=25.80

The standard error for μ^ can be found by extending Exercise 7.8 to find the joint asymptotic distribution of σ^2 and the slope estimates, and then applying the delta method.

We are interested in calculating standard errors and confidence intervals for the four estimates described above.

10.3 Jackknife Estimation of Variance

The jackknife estimates moments of estimators using the distribution of the leave-one-out estimators. The jackknife estimators of bias and variance were introduced by Quenouille (1949) and Tukey (1958), respectively. The idea was expanded further in the monographs of Efron (1982) and Shao and Tu (1995).

Let θ^ be any estimator of a vector-valued parameter θ which is a function of a random sample of size n. Let Vθ^=var[θ^] be the variance of θ^. Define the leave-one-out estimators θ^(i) which are computed using the formula for θ^ except that observation i is deleted. Tukey’s jackknife estimator for Vθ^ is defined as a scale of the sample variance of the leave-one-out estimators:

V^θ^jack =n1ni=1n(θ^(i)θ¯)(θ^(i)θ¯)

where θ¯ is the sample mean of the leave-one-out estimators θ¯=n1i=1nθ^(i). For scalar estimators θ^ the jackknife standard error is the square root of (10.1): sθ^jack =V^θ^jack .

A convenient feature of the jackknife estimator V^θ^jack  is that the formula (10.1) is quite general and does not require any technical (exact or asymptotic) calculations. A downside is that can require n separate estimations, which in some cases can be computationally costly.

In most cases V^θ^jack  will be similar to a robust asymptotic covariance matrix estimator. The main attractions of the jackknife estimator are that it can be used when an explicit asymptotic variance formula is not available and that it can be used as a check on the reliability of an asymptotic formula.

The formula (10.1) is not immediately intuitive so may benefit from some motivation. We start by examining the sample mean Y¯=1ni=1nYi for YRm. The leave-one-out estimator is

Y¯(i)=1n1jiYj=nn1Y¯1n1Yi.

The sample mean of the leave-one-out estimators is

1ni=1nY¯(i)=nn1Y¯1n1Y¯=Y¯

The difference is

Y¯(i)Y¯=1n1(Y¯Yi).

The jackknife estimate of variance (10.1) is then

V^Y¯jack =n1ni=1n(1n1)2(Y¯Yi)(Y¯Yi)=1n(1n1)i=1n(Y¯Yi)(Y¯Yi)

This is identical to the conventional estimator for the variance of Y¯. Indeed, Tukey proposed the (n1)/n scaling in (10.1) so that V^Y¯jack  precisely equals the conventional estimator.

We next examine the case of least squares regression coefficient estimator. Recall from (3.43) that the leave-one-out OLS estimator equals

β^(i)=β^(XX)1Xie~i

where e~i=(1hii)1e^i and hii=Xi(XX)1Xi. The sample mean of the leave-one-out estimators is β¯=β^(XX)1μ~ where μ~=n1i=1nXie~i. Thus β^(i)β¯=(XX)1(Xie~iμ~). The jackknife estimate of variance for β^ is

V^β^jack =n1ni=1n(β^(i)β¯)(β^(i)β¯)=n1n(XX)1(i=1nXiXie~i2nμ~μ~)(XX)1=n1nV^β^HC3(n1)(XX)1μ~μ~(XX)1

where V^β^HC is the HC3 covariance estimator (4.39) based on prediction errors. The second term in (10.5) is typically quite small since μ~ is typically small in magnitude. Thus V^β^jack V^β^HC. Indeed the HC3 estimator was originally motivated as a simplification of the jackknife estimator. This shows that for regression coefficients the jackknife estimator of variance is similar to a conventional robust estimator. This is accomplished without the user “knowing” the form of the asymptotic covariance matrix. This is further confirmation that the jackknife is making a reasonable calculation.

Third, we examine the jackknife estimator for a function θ^=r(β^) of a least squares estimator. The leave-one-out estimator of θ is

θ^(i)=r(β^(i))=r(β^(XX)1Xie~i)θ^R^(XX)1Xie~i

The second equality is (10.4). The final approximation is obtained by a mean-value expansion, using r(β^)=θ^ and setting R^=(/β)r(β^). This approximation holds in large samples because β^(i) are uniformly consistent for β. The jackknife variance estimator for θ^ thus equals

V^θ^jack=n1ni=1n(θ^(i)θ¯)(θ^(i)θ¯)n1nR^(XX)1(i=1nXiXie^i2nμ~μ~)(XX)1R^=R^V^β^jackR^R^V~β^R^.

The final line equals a delta-method estimator for the variance of θ^ constructed with the covariance estimator (4.39). This shows that the jackknife estimator of variance for θ^ is approximately an asymptotic delta-method estimator. While this is an asymptotic approximation, it again shows that the jackknife produces an estimator which is asymptotically similar to one produced by asymptotic methods. This is despite the fact that the jackknife estimator is calculated without reference to asymptotic theory and does not require calculation of the derivatives of r(β).

This argument extends directly to any “smooth function” estimator. Most of the estimators discussed so far in this textbook take the form θ^=g(W¯) where W¯=n1i=1nWi and Wi is some vector-valued function of the data. For any such estimator θ^ the leave-one-out estimator equals θ^(i)=g(W¯(i)) and its jackknife estimator of variance is (10.1). Using (10.2) and a mean-value expansion we have the largesample approximation

θ^(i)=g(W¯(i))=g(nn1W¯1n1Wi)g(W¯)1n1G(W¯)Wi

where G(x)=(/x)g(x). Thus

θ^(i)θ¯1n1G(W¯)(WiW¯)

and the jackknife estimator of the variance of θ^ approximately equals

V^θ^jack=n1ni=1n(θ^(i)θ^())(θ^(i)θ^())n1nG(W¯)(1(n1)2i=1n(WiW¯)(WiW¯))G(W¯)=G(W¯)V^W¯jackG(W¯)

where V^W¯jack  as defined in (10.3) is the conventional (and jackknife) estimator for the variance of W¯. Thus V^θ^jack  is approximately the delta-method estimator. Once again, we see that the jackknife estimator automatically calculates what is effectively the delta-method variance estimator, but without requiring the user to explicitly calculate the derivative of g(x).

10.4 Example

We illustrate by reporting the asymptotic and jackknife standard errors for the four parameter estimates given earlier. In Table 10.1 we report the actual values of the leave-one-out estimates for each of the twenty observations in the sample. The jackknife standard errors are calculated as the scaled square roots of the sample variances of these leave-one-out estimates and are reported in the second-to-last row. For comparison the asymptotic standard errors are reported in the final row.

For all estimates the jackknife and asymptotic standard errors are quite similar. This reinforces the credibility of both standard error estimates. The largest differences arise for β^2 and μ^, whose jackknife standard errors are about 5 larger than the asymptotic standard errors.

The take-away from our presentation is that the jackknife is a simple and flexible method for variance and standard error calculation. Circumventing technical asymptotic and exact calculations, the jackknife produces estimates which in many cases are similar to asymptotic delta-method counterparts. The jackknife is especially appealing in cases where asymptotic standard errors are not available or are difficult to calculate. They can also be used as a double-check on the reasonableness of asymptotic delta-method calculations.

In Stata, jackknife standard errors for coefficient estimates in many models are obtained by the vce(jackknife) option. For nonlinear functions of the coefficients or other estimators the jackkn ife command can be combined with any other command to obtain jackknife standard errors.

To illustrate, below we list the Stata commands which calculate the jackknife standard errors listed above. The first line is least squares estimation with standard errors calculated by the jackknife. The second line calculates the error variance estimate σ^2 with a jackknife standard error. The third line does the same for the estimate μ^.

Table 10.1: Leave-one-out Estimators and Jackknife Standard Errors

Observation β^1(i) β^2(i) σ^(i)2 μ^(i)
1 0.150 0.764 0.150 25.63
2 0.148 0.798 0.149 25.48
3 0.153 0.739 0.151 25.97
4 0.156 0.695 0.144 26.31
5 0.154 0.701 0.146 25.38
6 0.158 0.655 0.151 26.05
7 0.152 0.705 0.114 24.32
8 0.146 0.822 0.147 25.37
9 0.162 0.588 0.151 25.75
10 0.157 0.693 0.139 26.40
11 0.168 0.510 0.141 26.40
12 0.158 0.691 0.118 26.48
13 0.139 0.974 0.141 26.56
14 0.169 0.451 0.131 26.26
15 0.146 0.852 0.150 24.93
16 0.156 0.696 0.148 26.06
17 0.165 0.513 0.140 25.22
18 0.155 0.698 0.151 25.90
19 0.152 0.742 0.151 25.73
20 0.155 0.697 0.151 25.95
sjack  0.032 0.514 0.046 2.39
sasy  0.031 0.493 0.043 2.29

10.5 Jackknife for Clustered Observations

In Section 4.21 we introduced the clustered regression model, cluster-robust variance estimators, and cluster-robust standard errors. Jackknife variance estimation can also be used for clustered samples but with some natural modifications. Recall that the least squares estimator in the clustered sample context can be written as

β^=(g=1GXgXg)1(g=1GXgYg)

where g=1,,G indexes the cluster. Instead of leave-one-out estimators, it is natural to use deletecluster estimators, which delete one cluster at a time. They take the form (4.58):

β^(g)=β^(XX)1Xge~g

where

e~g=(IngXg(XX)1Xg)1e^ge^g=YgXgβ^

The delete-cluster jackknife estimator of the variance of β^ is

V^β^jack=G1Gg=1G(β^(g)β¯)(β^(g)β¯)β¯=1Gg=1Gβ^(g).

We call V^β^jack  a cluster-robust jackknife estimator of variance.

Using the same approximations as the previous section we can show that the delete-cluster jackknife estimator is asymptotically equivalent to the cluster-robust covariance matrix estimator (4.59) calculated with the delete-cluster prediction errors. This verifies that the delete-cluster jackknife is the appropriate jackknife approach for clustered dependence.

For parameters which are functions θ^=r(β^) of the least squares estimator, the delete-cluster jackknife estimator of the variance of θ^ is

V^θ^jack =G1Gg=1G(θ^(g)θ¯)(θ^(g)θ¯)θ^(i)=r(β^(g))θ¯=1Gg=1Gθ^(g).

Using a mean-value expansion we can show that this estimator is asymptotically equivalent to the deltamethod cluster-robust covariance matrix estimator for θ^. This shows that the jackknife estimator is appropriate for covariance matrix estimation.

As in the context of i.i.d. samples, one advantage of the jackknife covariance matrix estimators is that they do not require the user to make a technical calculation of the asymptotic distribution. A downside is an increase in computation cost, as G separate regressions are effectively estimated.

In Stata, jackknife standard errors for coefficient estimates with clustered observations are obtained by using the options cluster (id) vce(jackkn ife) where id denotes the cluster variable.

10.6 The Bootstrap Algorithm

The bootstrap is a powerful approach to inference and is due to the pioneering work of Efron (1979). There are many textbook and monograph treatments of the bootstrap, including Efron (1982), Hall (1992), Efron and Tibshirani (1993), Shao and Tu (1995), and Davison and Hinkley (1997). Reviews for econometricians are provided by Hall (1994) and Horowitz (2001)

There are several ways to describe or define the bootstrap and there are several forms of the bootstrap. We start in this section by describing the basic nonparametric bootstrap algorithm. In subsequent sections we give more formal definitions of the bootstrap as well as theoretical justifications.

Briefly, the bootstrap distribution is obtained by estimation on independent samples created by i.i.d. sampling (sampling with replacement) from the original dataset.

To understand this it is useful to start with the concept of sampling with replacement from the dataset. To continue the empirical example used earlier in the chapter we focus on the dataset displayed in Table 3.1, which has n=20 observations. Sampling from this distribution means randomly selecting one row from this table. Mathematically this is the same as randomly selecting an integer from the set {1,2,,20}. To illustrate, MATLAB has a random integer generator (the function randi). Using the random number seed of 13 (an arbitrary choice) we obtain the random draw 16 . This means that we draw observation number 16 from Table 3.1. Examining the table we can see that this is an individual with wage $18.75 and education of 16 years. We repeat by drawing another random integer on the set {1,2,,20} and this time obtain 5 . This means we take observation 5 from Table 3.1, which is an individual with wage $33.17 and education of 16 years. We continue until we have n=20 such draws. This random set of observations are {16,5,17,20,20,10,13,16,13,15,1,6,2,18,8,14,6,7,1,8}. We call this the bootstrap sample.

Notice that the observations 1,6,8,13,16,20 each appear twice in the bootstrap sample, and the observations 3,4,9,11,12,19 do not appear at all. That is okay. In fact, it is necessary for the bootstrap to work. This is because we are drawing with replacement. (If we instead made draws without replacement then the constructed dataset would have exactly the same observations as in Table 3.1, only in different order.) We can also ask the question “What is the probability that an individual observation will appear at least once in the bootstrap sample?” The answer is

P[ Observation in Bootstrap Sample ]=1(11n)n1e10.632.

The limit holds as n. The approximation 0.632 is excellent even for small n. For example, when n=20 the probability (10.6) is 0.641. These calculations show that an individual observation is in the bootstrap sample with probability near 2/3.

Once again, the bootstrap sample is the constructed dataset with the 20 observations drawn randomly from the original sample. Notationally, we write the ith  bootstrap observation as (Yi,Xi) and the bootstrap sample as {(Y1,X1),,(Yn,Xn)}. In our present example with Y denoting the log wage the bootstrap sample is

{(Y1,X1),,(Yn,Xn)}={(2.93,16),(3.50,16),(3.76,18)}

The bootstrap estimate β^ is obtained by applying the least squares estimation formula to the bootstrap sample. Thus we regress Y on X. The other bootstrap estimates, in our example σ^2 and μ^, are obtained by applying their estimation formulae to the bootstrap sample as well. Writing θ^= (β^1,β^2,σ^2,μ^) we have the bootstrap estimate of the parameter vector θ=(β1,β2,σ2,μ). In our example (the bootstrap sample described above) θ^=(0.195,0.113,0.107,26.7). This is one draw from the bootstrap distribution of the estimates.

The estimate θ^ as described is one random draw from the distribution of estimates obtained by i.i.d. sampling from the original data. With one draw we can say relatively little. But we can repeat this exercise to obtain multiple draws from this bootstrap distribution. To distinguish between these draws we index the bootstrap samples by b=1,,B, and write the bootstrap estimates as θ^b or θ^(b).

To continue our illustration we draw 20 more random integers {19,5,7,19,1,2,13,18,1,15,17,2, 14,11,10,20,1,5,15,7} and construct a second bootstrap sample. On this sample we again estimate the parameters and obtain θ^(2)=(0.175,0.52,0.124,29.3). This is a second random draw from the distribution of θ^. We repeat this B times, storing the parameter estimates θ^(b). We have thus created a new dataset of bootstrap draws {θ^(b):b=1,,B}. By construction the draws are independent across b and identically distributed.

The number of bootstrap draws, B, is often called the “number of bootstrap replications”. Typical choices for B are 1000,5000 , and 10,000. We discuss selecting B later, but roughly speaking, larger B results in a more precise estimate at an increased computation cost. For our application we set B= 10,000 . To illustrate, Figure 13.1 displays the densities of the distributions of the bootstrap estimates β^1 and μ^ across 10,000 draws. The dashed lines show the point estimate. You can notice that the density for β^1 is slightly skewed to the left.\

Figure 10.1: Bootstrap Distributions of β^1 and μ^

10.7 Bootstrap Variance and Standard Errors

Given the bootstrap draws we can estimate features of the bootstrap distribution. The bootstrap estimator of variance of an estimator θ^ is the sample variance across the bootstrap draws θ^(b). It equals

V^θ^boot =1B1b=1B(θ^(b)θ¯)(θ^(b)θ¯)θ¯=1Bb=1Bθ^(b)

For a scalar estimator θ^ the bootstrap standard error is the square root of the bootstrap estimator of variance:

sθ^^boot =V^θ^boot .

This is a very simple statistic to calculate and is the most common use of the bootstrap in applied econometric practice. A caveat (discussed in more detail in Section 10.15) is that in many cases it is better to use a trimmed estimator.

Standard errors are conventionally reported to convey the precision of the estimator. They are also commonly used to construct confidence intervals. Bootstrap standard errors can be used for this purpose. The normal-approximation bootstrap confidence interval is

Cnb=[θ^z1α/2sθ^boot ,θ^+z1α/2sθ^boot ]

where z1α/2 is the 1α/2 quantile of the N(0,1) distribution. This interval Cnb is identical in format to an asymptotic confidence interval, but with the bootstrap standard error replacing the asymptotic standard error. Cnb is the default confidence interval reported by Stata when the bootstrap has been used to calculate standard errors. However, the normal-approximation interval is in general a poor choice for confidence interval construction as it relies on the normal approximation to the t-ratio which can be inaccurate in finite samples. There are other methods - such as the bias-corrected percentile method to be discussed in Section 10.17 - which are just as simple to compute but have better performance. In general, bootstrap standard errors should be used as estimates of precision rather than as tools to construct confidence intervals.

Since B is finite, all bootstrap statistics, such as V^θ^boot , are estimates and hence random. Their values will vary across different choices for B and simulation runs (depending on how the simulation seed is set). Thus you should not expect to obtain the exact same bootstrap standard errors as other researchers when replicating their results. They should be similar (up to simulation sampling error) but not precisely the same.

In Table 10.2 we report the four parameter estimates introduced in Section 10.2 along with asymptotic, jackknife and bootstrap standard errors. We also report four bootstrap confidence intervals which will be introduced in subsequent sections.

For these four estimators we can see that the bootstrap standard errors are quite similar to the asymptotic and jackknife standard errors. The most noticable difference arises for β^2, where the bootstrap standard error is about 10 larger than the asymptotic standard error.

Table 10.2: Comparison of Methods

β^1 β^2 σ^2 μ^
Estimate 0.155 0.698 0.144 25.80
Asymptotic s.e. (0.031) (0.493) (0.043) (2.29)
Jackknife s.e. (0.032) (0.514) (0.046) (2.39)
Bootstrap s.e. (0.034) (0.548) (0.041) (2.38)
95 Percentile Interval [0.08,0.21] [0.27,1.91] [0.06,0.22] [21.4,30.7]
95 BC Percentile Interval [0.08,0.21] [0.25,1.93] [0.09,0.28] [22.0,31.5]
95 BC

In Stata, bootstrap standard errors for coefficient estimates in many models are obtained by the vce(bootstrap, reps(#)) option, where # is the number of bootstrap replications. For nonlinear functions of the coefficients or other estimators the bootstrap command can be combined with any other command to obtain bootstrap standard errors. Synonyms for bootstrap are bstrap and bs.

To illustrate, below we list the Stata commands which will calculate 1 the bootstrap standard errors listed above.

1 They will not precisely replicate the standard errors since those in Table 10.2 were produced in Matlab which uses a different random number sequence.

Stata Commands reg wage education if mbf12==1, vce(bootstrap, reps (10000))

bs (e(rss)/e(N)), reps(10000): reg wage education if mbf12==1

bs ( exp(16 bb[education]+_b[_cons] +e(rss)/e(N)/2)), reps(10000): ///

reg wage education if mbf12==1

10.8 Percentile Interval

The second most common use of bootstrap methods is for confidence intervals. There are multiple bootstrap methods to form confidence intervals. A popular and simple method is called the percentile interval. It is based on the quantiles of the bootstrap distribution.

In Section 10.6 we described the bootstrap algorithm which creates an i.i.d. sample of bootstrap estimates {θ^1,θ^2,,θ^B} corresponding to an estimator θ^ of a parameter θ. We focus on the case of a scalar parameter θ.

For any 0<α<1 we can calculate the empirical quantile qα of these bootstrap estimates. This is the number such that nα bootstrap estimates are smaller than qα, and is typically calculated by taking the nαth order statistic of the θ^b. See Section 11.13 of Probability and Statistics for Economists for a precise discussion of empirical quantiles and common quantile estimators.

The percentile bootstrap 100(1α) confidence interval is

Cpc=[qα/2,q1α/2].

For example, if B=1000,α=0.05, and the empirical quantile estimator is used, then Cpc=[θ^(25),θ^(975)].

To illustrate, the 0.025 and 0.975 quantiles of the bootstrap distributions of β^1 and μ^ are indicated in Figure 13.1 by the arrows. The intervals between the arrows are the 95 percentile intervals.

The percentile interval has the convenience that it does not require calculation of a standard error. This is particularly convenient in contexts where asymptotic standard error calculation is complicated, burdensome, or unknown. Cpc is a simple by-product of the bootstrap algorithm and does not require meaningful computational cost above that required to calculate the bootstrap standard error.

The percentile interval has the useful property that it is transformation-respecting. Take a monotone parameter transformation m(θ). The percentile interval for m(θ) is simply the percentile interval for θ mapped by m(θ). That is, if [qα/2,q1α/2] is the percentile interval for θ, then [m(qα/2),m(q1α/2)] is the percentile interval for m(θ). This property follows directly from the equivariance property of sample quantiles. Many confidence-interval methods, such as the delta-method asymptotic interval and the normal-approximation interval Cnb, do not share this property.

To illustrate the usefulness of the transformation-respecting property consider the variance σ2. In some cases it is useful to report the variance σ2 and in other cases it is useful to report the standard deviation σ. Thus we may be interested in confidence intervals for σ2 or σ. To illustrate, the asymptotic 95 normal confidence interval for σ2 which we calculate from Table 13.2 is [0.060,0.228]. Taking square roots we obtain an interval for σ of [0.244,0.477]. Alternatively, the delta method standard error for σ^=0.379 is 0.057, leading to an asymptotic 95 confidence interval for σ of [0.265,0.493] which is different. This shows that the delta method is not transformation-respecting. In contrast, the 95 percentile interval for σ2 is [0.062,0.220] and that for σ is [0.249,0.469] which is identical to the square roots of the interval for σ2.

The bootstrap percentile intervals for the four estimators are reported in Table 13.2. In Stata, percentile confidence intervals can be obtained by using the command estat bootstrap, percentile or the command estat bootstrap, all after an estimation command which calculates standard errors via the bootstrap.

10.9 The Bootstrap Distribution

For applications it is often sufficient if one understands the bootstrap as an algorithm. However, for theory it is more useful to view the bootstrap as a specific estimator of the sampling distribution. For this it is useful to introduce some additional notation.

The key is that the distribution of any estimator or statistic is determined by the distribution of the data. While the latter is unknown it can be estimated by the empirical distribution of the data. This is what the bootstrap does.

To fix notation, let F denote the distribution of an individual observation W. (In regression, W is the pair(Y,X).) Let Gn(u,F) denote the distribution of an estimator θ^. That is,

Gn(u,F)=P[θ^uF].

We write the distribution Gn as a function of n and F since the latter (generally) affect the distribution of θ^. We are interested in the distribution Gn. For example, we want to know its variance to calculate a standard error or its quantiles to calculate a percentile interval.

In principle, if we knew the distribution F we should be able to determine the distribution Gn. In practice there are two barriers to implementation. The first barrier is that the calculation of Gn(u,F) is generally infeasible except in certain special cases such as the normal regression model. The second barrier is that in general we do not know F.

The bootstrap simultaneously circumvents these two barriers by two clever ideas. First, the bootstrap proposes estimation of F by the empirical distribution function (EDF) Fn, which is the simplest nonparametric estimator of the joint distribution of the observations. The EDF is Fn(w)=n1i=1n1{Wiw}. (See Section 11.2 of Probability and Statistics for Economists for details and properties.) Replacing F with Fn we obtain the idealized bootstrap estimator of the distribution of θ^

Gn(u)=Gn(u,Fn).

The bootstrap’s second clever idea is to estimate Gn by simulation. This is the bootstrap algorithm described in the previous sections. The essential idea is that simulation from Fn is sampling with replacement from the original data, which is computationally simple. Applying the estimation formula for θ^ we obtain i.i.d. draws from the distribution Gn(u). By making a large number B of such draws we can estimate any feature of Gn of interest. The bootstrap combines these two ideas: (1) estimate Gn(u,F) by Gn(u,Fn); (2) estimate Gn(u,Fn) by simulation. These ideas are intertwined. Only by considering these steps together do we obtain a feasible method.

The way to think about the connection between Gn and Gn is as follows. Gn is the distribution of the estimator θ^ obtained when the observations are sampled i.i.d. from the population distribution F. Gn is the distribution of the same statistic, denoted θ^, obtained when the observations are sampled i.i.d. from the empirical distribution Fn. It is useful to conceptualize the “universe” which separately generates the dataset and the bootstrap sample. The “sampling universe” is the population distribution F. In this universe the true parameter is θ. The “bootstrap universe” is the empircal distribution Fn. When drawing from the bootstrap universe we are treating Fn as if it is the true distribution. Thus anything which is true about Fn should be treated as true in the bootstrap universe. In the bootstrap universe the “true” value of the parameter θ is the value determined by the EDF Fn. In most cases this is the estimate θ^. It is the true value of the coefficient when the true distribution is Fn. We now carefully explain the connection with the bootstrap algorithm as previously described.

First, observe that sampling with replacement from the sample {Y1,,Yn} is identical to sampling from the EDF Fn. This is because the EDF is the probability distribution which puts probability mass 1/n on each observation. Thus sampling from Fn means sampling an observation with probability 1/n, which is sampling with replacement.

Second, observe that the bootstrap estimator θ^ described here is identical to the bootstrap algorithm described in Section 10.6. That is, θ^ is the random vector generated by applying the estimator formula θ^ to samples obtained by random sampling from Fn.

Third, observe that the distribution of these bootstrap estimators is the bootstrap distribution (10.9). This is a precise equality. That is, the bootstrap algorithm generates i.i.d. samples from Fn, and when the estimators are applied we obtain random variables θ^ with the distribution Gn.

Fourth, observe that the bootstrap statistics described earlier - bootstrap variance, standard error, and quantiles - are estimators of the corresponding features of the bootstrap distribution Gn.

This discussion is meant to carefully describe why the notation Gn(u) is useful to help understand the properties of the bootstrap algorithm. Since Fn is the natural nonparametric estimator of the unknown distribution F,Gn(u)=Gn(u,Fn) is the natural plug-in estimator of the unknown Gn(u,F). Furthermore, because Fn is uniformly consistent for F by the Glivenko-Cantelli Lemma (Theorem 18.8 in Probability and Statistics for Economists) we also can expect Gn(u) to be consistent for Gn(u). Making this precise is a bit challenging since Fn and Gn are functions. In the next several sections we develop an asymptotic distribution theory for the bootstrap distribution based on extending asymptotic theory to the case of conditional distributions.

10.10 The Distribution of the Bootstrap Observations

Let Y be a random draw from the sample {Y1,,Yn}. What is the distribution of Y ?

Since we are fixing the observations, the correct question is: What is the conditional distribution of Y, conditional on the observed data? The empirical distribution function Fn summarizes the information in the sample, so equivalently we are talking about the distribution conditional on Fn. Consequently we will write the bootstrap probability function and expectation as

P[Yx]=P[YxFn]E[Y]=E[YFn].

Notationally, the starred distribution and expectation are conditional given the data.

The (conditional) distribution of Y is the empirical distribution function Fn, which is a discrete distribution with mass points 1/n on each observation Yi. Thus even if the original data come from a continuous distribution, the bootstrap data distribution is discrete.

The (conditional) mean and variance of Y are calculated from the EDF, and equal the sample mean and variance of the data. The mean is

E[Y]=i=1nYiP[Y=Yi]=i=1nYi1n=Y¯

and the variance is

var[Y]=E[YY](E[Y])(E[Y])=i=1nYiYiP[Y=Yi]Y¯Y¯=i=1nYiYi1nY¯Y¯=Σ^

To summarize, the conditional distribution of Y, given Fn, is the discrete distribution on {Y1,,Yn} with mean Y¯ and covariance matrix Σ^.

We can extend this analysis to any integer moment r. Assume Y is scalar. The rth moment of Y is

μr=E[Yr]=i=1nYirP[Y=Yi]=1ni=1nYir=μ^r,

the rth sample moment. The rth central moment of Y is

μr=E[(YY¯)r]=1ni=1n(YiY¯)r=μ^r,

the rth central sample moment. Similarly, the rth cumulant of Y is κr=κ^r, the rth sample cumulant.

10.11 The Distribution of the Bootstrap Sample Mean

The bootstrap sample mean is

Y¯=1ni=1nYi.

We can calculate its (conditional) mean and variance. The mean is

E[Y¯]=E[1ni=1nYi]=1ni=1nE[Yi]=1ni=1nY¯=Y¯

using (10.10). Thus the bootstrap sample mean Y¯ has a distribution centered at the sample mean Y¯. This is because the bootstrap observations Yi are drawn from the bootstrap universe, which treats the EDF as the truth, and the mean of the latter distribution is Y¯.

The (conditional) variance of the bootstrap sample mean is

var[Y¯]=var[1ni=1nYi]=1n2i=1nvar[Yi]=1n2i=1nΣ^=1nΣ^

using (10.11). In the scalar case, var[Y¯]=σ^2/n. This shows that the bootstrap variance of Y¯ is precisely described by the sample variance of the original observations. Again, this is because the bootstrap observations Yi are drawn from the bootstrap universe.

We can extend this to any integer moment r. Assume Y is scalar. Define the normalized bootstrap sample mean Zn=n(Y¯Y¯). Using expressions from Section 6.17 of Probability and Statistics for Economists, the 3rd through 6th  conditional moments of Zn are

E[Zn3]=κ^3/n1/2E[Zn4]=κ^4/n+3κ^22E[Zn5]=κ^5/n3/2+10κ^3κ^2/n1/2E[Zn6]=κ^6/n2+(15κ^4κ2+10κ^32)/n+15κ^23

where κ^r is the rth sample cumulant. Similar expressions can be derived for higher moments. The moments (10.14) are exact, not approximations.

10.12 Bootstrap Asymptotics

The bootstrap mean Y¯ is a sample average over n i.i.d. random variables, so we might expect it to converge in probability to its expectation. Indeed, this is the case, but we have to be a bit careful since the bootstrap mean has a conditional distribution (given the data) so we need to define convergence in probability for conditional distributions.

Definition 10.1 We say that a random vector Zn converges in bootstrap probability to Z as n, denoted ZnpZ, if for all ϵ>0

P[ZnZ>ϵ]p0

To understand this definition recall that conventional convergence in probability Znp means that for a sufficiently large sample size n, the probability is high that Zn is arbitrarily close to its limit Z. In contrast, Definition 10.1 says ZnpZ means that for a sufficiently large n, the probability is high that the conditional probability that Zn is close to its limit Z is high. Note that there are two uses of probability - both unconditional and conditional.

Our label “convergence in bootstrap probability” is a bit unusual. The label used in much of the statistical literature is “convergence in probability, in probability” but that seems like a mouthful. That literature more often focuses on the related concept of “convergence in probability, almost surely” which holds if we replace the ”  " p convergence with almost sure convergence. We do not use this concept in this chapter as it is an unnecessary complication.

While we have stated Definition 10.1 for the specific conditional probability distribution P, the idea is more general and can be used for any conditional distribution and any sequence of random vectors.

The following may seem obvious but it is useful to state for clarity. Its proof is given in Section 10.31.

Theorem 10.1 If ZnpZ as n then ZnpZ.

Given Definition 10.1, we can establish a law of large numbers for the bootstrap sample mean. Theorem 10.2 Bootstrap WLLN. If Yi are independent and uniformly integrable then Y¯Y¯p0 and Y¯pμ=E[Y] as n.

The proof (presented in Section 10.31) is somewhat different from the classical case as it is based on the Marcinkiewicz WLLN (Theorem 10.20, presented in Section 10.31).

Notice that the conditions for the bootstrap WLLN are the same for the conventional WLLN. Notice as well that we state two related but slightly different results. The first is that the difference between the bootstrap sample mean Y¯ and the sample mean Y¯ diminishes as the sample size diverges. The second result is that the bootstrap sample mean converges to the population mean μ. The latter is not surprising (since the sample mean Y¯ converges in probability to μ ) but it is constructive to be precise since we are dealing with a new convergence concept.

Theorem 10.3 Bootstrap Continuous Mapping Theorem. If Znpc as n and g() is continuous at c, then g(Zn)pg(c) as n.

The proof is essentially identical to that of Theorem 6.6 so is omitted.

We next would like to show that the bootstrap sample mean is asymptotically normally distributed, but for that we need a definition of convergence for conditional distributions.

Definition 10.2 Let Zn be a sequence of random vectors with conditional distributions Gn(x)=P[Znx]. We say that Zn converges in bootstrap distribution to Z as n, denoted Znd, if for all x at which G(x)=P[Zx] is continuous, Gn(x)pG(x) as n.

The difference with the conventional definition is that Definition 10.2 treats the conditional distribution as random. An alternative label for Definition 10.2 is “convergence in distribution, in probability”.

We now state a CLT for the bootstrap sample mean, with a proof given in Section 10.31.

Theorem 10.4 Bootstrap CLT. If Yi are i.i.d., EY2<, and Σ=var[Y]>0, then as n,n(Y¯Y¯)dN(0,Σ).

Theorem 10.4 shows that the normalized bootstrap sample mean has the same asymptotic distribution as the sample mean. Thus the bootstrap distribution is asymptotically the same as the sampling distribution. A notable difference, however, is that the bootstrap sample mean is normalized by centering at the sample mean, not at the population mean. This is because Y¯ is the true mean in the bootstrap universe.

We next state the distributional form of the continuous mapping theorem for bootstrap distributions and the Bootstrap Delta Method. Theorem 10.5 Bootstrap Continuous Mapping Theorem

If ZndZ as n and g:RmRk has the set of discontinuity points Dg such that P[ZDg]=0, then g(Zn)dg(Z) as n.

Theorem 10.6 Bootstrap Delta Method: If μ^pμ,n(μ^μ^)dξ, and g(u) is continuously differentiable in a neighborhood of μ, then as n

n(g(μ^)g(μ^))dGξ

where G(x)=xg(x) and G=G(μ). In particular, if ξN(0,V) then as n

n(g(μ^)g(μ^))dN(0,GVG).

For a proof, see Exercise 10.7.

We state an analog of Theorem 6.10, which presented the asymptotic distribution for general smooth functions of sample means, which covers most econometric estimators.

Theorem 10.7 Under the assumptions of Theorem 6.10, that is, if Yi is i.i.d., μ=E[h(Y)],θ=g(μ),Eh(Y)2<, and G(x)=xg(x) is continuous in a neighborhood of μ, for θ^=g(μ^) with μ^=1ni=1nh(Yi) and θ^=g(μ^) with μ^=1ni=1nh(Yi), as n

n(θ^θ^)dN(0,Vθ)

where Vθ=GVG,V=E[(h(Y)μ)(h(Y)μ)] and G=G(μ).

For a proof, see Exercise 10.8.

Theorem 10.7 shows that the asymptotic distribution of the bootstrap estimator θ^ is identical to that of the sample estimator θ^. This means that we can learn the distribution of θ^ from the bootstrap distribution, and hence perform asymptotically correct inference.

For some bootstrap applications we use bootstrap estimates of variance. The plug-in estimator of Vθ is V^θ=G^V^G^ where G^=G(μ^) and

V^=1ni=1n(h(Yi)μ^)(h(Yi)μ^).

The bootstrap version is

V^θ=G^V^G^G^=G(μ^)V^=1ni=1n(h(Yi)μ^)(h(Yi)μ^).

Application of the bootstrap WLLN and bootstrap CMT show that V^θ is consistent for Vθ.

Theorem 10.8 Under the assumptions of Theorem 10.7, V^θpVθ as n.

For a proof, see Exercise 10.9.

10.13 Consistency of the Bootstrap Estimate of Variance

Recall the definition (10.7) of the bootstrap estimator of variance V^θ^boot  of an estimator θ^. In this section we explore conditions under which V^θ^boot  is consistent for the asymptotic variance of θ^.

To do so it is useful to focus on a normalized version of the estimator so that the asymptotic variance is not degenerate. Suppose that for some sequence an we have

Zn=an(θ^θ)dξ

and

Zn=an(θ^θ^)dξ

for some limit distribution ξ. That is, for some normalization, both θ^ and θ^ have the same asymptotic distribution. This is quite general as it includes the smooth function model. The conventional bootstrap estimator of the variance of Zn is the sample variance of the bootstrap draws {Zn(b):b=1,,B}. This equals the estimator (10.7) multiplied by an2. Thus it is equivalent (up to scale) whether we discuss estimating the variance of θ^ or Zn.

The bootstrap estimator of variance of Zn is

V^θboot,B =1B1b=1B(Zn(b)Zn)(Zn(b)Zn)Z¯n=1Bb=1BZn(b)

Notice that we index the estimator by the number of bootstrap replications B.

Since Zn converges in bootstrap distribution to the same asymptotic distribution as Zn, it seems reasonable to guess that the variance of Zn will converge to that of ξ. However, convergence in distribution is not sufficient for convergence in moments. For the variance to converge it is also necessary for the sequence Zn to be uniformly square integrable. Theorem 10.9 If (10.15) and (10.16) hold for some sequence an and Zn2 is uniformly integrable, then as B

V^θboot,BpV^θboot =var[Zn]

and as n

V^θboot pVθ=var[ξ].

This raises the question: Is the normalized sequence Zn uniformly integrable? We spend the remainder of this section exploring this question and turn in the next section to trimmed variance estimators which do not require uniform integrability.

This condition is reasonably straightforward to verify for the case of a scalar sample mean with a finite variance. That is, suppose Zn=n(Y¯Y¯) and E[Y2]<. In (10.14) we calculated the exact fourth central moment of Zn :

E[Zn4]=κ^4n+3σ^4=μ^43σ^4n+3σ^4

where σ^2=n1i=1n(YiY¯)2 and μ^4=n1i=1n(YiY¯)4. The assumption E[Y2]< implies that E[σ^2]=O(1) so σ^2=Op(1). Furthermore, n1μ^4=n2i=1n(YiY¯)4=op(1) by the Marcinkiewicz WLLN (Theorem 10.20). It follows that

E[Zn4]=n2E[(Y¯Y¯)4]=Op(1).

Theorem 6.13 shows that this implies that Zn2 is uniformly integrable. Thus if Y has a finite variance the normalized bootstrap sample mean is uniformly square integrable and the bootstrap estimate of variance is consistent by Theorem 10.9.

Now consider the smooth function model of Theorem 10.7. We can establish the following result.

Theorem 10.10 In the smooth function model of Theorem 10.7, if for some p1 the pth-order derivatives of g(x) are bounded, then Zn=n(θ^θ^) is uniformly square integrable and the bootstrap estimator of variance is consistent as in Theorem 10.9.

For a proof see Section 10.31.

This shows that the bootstrap estimate of variance is consistent for a reasonably broad class of estimators. The class of functions g(x) covered by this result includes all pth-order polynomials.

10.14 Trimmed Estimator of Bootstrap Variance

Theorem 10.10 showed that the bootstrap estimator of variance is consistent for smooth functions with a bounded pth order derivative. This is a fairly broad class but excludes many important applications. An example is θ=μ1/μ2 where μ1=E[Y1] and μ2=E[Y2]. This function does not have a bounded derivative (unless μ2 is bounded away from zero) so is not covered by Theorem 10.10. This is more than a technical issue. When (Y1,Y2) are jointly normally distributed then it is known that θ^=Y¯1/Y¯2 does not possess a finite variance. Consequently we cannot expect the bootstrap estimator of variance to perform well. (It is attempting to estimate the variance of θ^, which is infinity.)

In these cases it is preferred to use a trimmed estimator of bootstrap variance. Let τn be a sequence of positive trimming numbers satisfying τn=O(en/8). Define the trimmed statistic

Zn=Zn1{Znτn}.

The trimmed bootstrap estimator of variance is

V^θboot, ,τ=1B1b=1B(Zn(b)Zn)(Zn(b)Zn)Zn=1Bb=1BZn(b).

We first examine the behavior of V^θboot, B as the number of bootstrap replications B grows to infinity. It is a sample variance of independent bounded random vectors. Thus by the bootstrap WLLN (Theorem 10.2) V^θboot,B,τ converges in bootstrap probability to the variance of Zn.

We next examine the behavior of the bootstrap estimator V^θboot, τ as n grows to infinity. We focus on the smooth function model of Theorem 10.7, which showed that Zn=n(θ^θ^)dN(0,Vθ). Since the trimming is asymptotically negligible, it follows that Znd. If we can show that Zn is uniformly square integrable, Theorem 10.9 shows that var[Zn]var[Z]=Vθ as n. This is shown in the following result, whose proof is presented in Section 10.31.

Theorem 10.12 Under the assumptions of Theorem 10.7, V^θboot,τpVθ.

Theorems 10.11 and 10.12 show that the trimmed bootstrap estimator of variance is consistent for the asymptotic variance in the smooth function model, which includes most econometric estimators. This justifies bootstrap standard errors as consistent estimators for the asymptotic distribution.

An important caveat is that these results critically rely on the trimmed variance estimator. This is a critical caveat as conventional statistical packages (e.g. Stata) calculate bootstrap standard errors using the untrimmed estimator (10.7). Thus there is no guarantee that the reported standard errors are consistent. The untrimmed variance estimator works in the context of Theorem 10.10 and whenever the bootstrap statistic is uniformly square integrable, but not necessarily in general applications.

In practice, it may be difficult to know how to select the trimming sequence τn. The rule τn=O(en/8) does not provide practical guidance. Instead, it may be useful to think about trimming in terms of percentages of the bootstrap draws. Thus we can set τn so that a given small percentage γn is trimmed. For theoretical interpretation we would set γn0 as n. In practice we might set γn=1.

10.15 Unreliability of Untrimmed Bootstrap Standard Errors

In the previous section we presented a trimmed bootstrap variance estimator which should be used to form bootstrap standard errors for nonlinear estimators. Otherwise, the untrimmed estimator is potentially unreliable.

This is an unfortunate situation, because reporting of bootstrap standard errors is commonplace in contemporary applied econometric practice, and standard applications (including Stata) use the untrimmed estimator.

To illustrate the seriousness of the problem we use the simple wage regression (7.31) which we repeat here. This is the subsample of married Black women with 982 observations. The point estimates and standard errors are

We are interested in the experience level which maximizes expected log wages θ3=50β2/β3. The point estimate and standard errors calculated with different methods are reported in Table 10.3 below.

The point estimate of the experience level with maximum earnings is θ^3=35. The asymptotic and jackknife standard errors are about 7 . The bootstrap standard error, however, is 825 ! Confused by this unusual value we rerun the bootstrap and obtain a standard error of 544 . Each was computed with 10,000 bootstrap replications. The fact that the two bootstrap standard errors are considerably different when recomputed (with different starting seeds) is indicative of moment failure. When there is an enormous discrepancy like this between the asymptotic and bootstrap standard error, and between bootstrap runs, it is a signal that there may be moment failure and consequently bootstrap standard errors are unreliable.

A trimmed bootstrap with τ=25 (set to slightly exceed three asymptotic standard errors) produces a more reasonable standard error of 10.

One message from this application is that when different methods produce very different standard errors we should be cautious about trusting any single method. The large discrepancies indicate poor asymptotic approximations, rendering all methods inaccurate. Another message is to be cautious about reporting conventional bootstrap standard errors. Trimmed versions are preferred, especially for nonlinear functions of estimated coefficients.

Table 10.3: Experience Level Which Maximizes Expected log Wages

Estimate 35.2
Asymptotic s.e. (7.0)
Jackknife s.e. (7.0)
Bootstrap s.e. (standard) (825)
Bootstrap s.e. (repeat) (544)
Bootstrap s.e. (trimmed) (10.1)

10.16 Consistency of the Percentile Interval

Recall the percentile interval (10.8). We now provide conditions under which it has asymptotically correct coverage. Theorem 10.13 Assume that for some sequence an

an(θ^θ)dξ

and

an(θ^θ^)dξ

where ξ is continuously distributed and symmetric about zero. Then P[θCpc]1α as n

The assumptions (10.18)-(10.19) hold for the smooth function model of Theorem 10.7, so this result incorporates many applications. The beauty of Theorem 10.13 is that the simple confidence interval Cpc - which does not require technical calculation of asymptotic standard errors - has asymptotically valid coverage for any estimator which falls in the smooth function class, as well as any other estimator satisfying the convergence results (10.18)-(10.19) with ξ symmetrically distributed. The conditions are weaker than those required for consistent bootstrap variance estimation (and normal-approximation confidence intervals) because it is not necessary to verify that θ^ is uniformly integrable, nor necessary to employ trimming.

The proof of Theorem 10.7 is not difficult. The convergence assumption (10.19) implies that the αth quantile of an(θ^θ^), which is an(qαθ^) by quantile equivariance, converges in probability to the αth quantile of ξ, which we can denote as q¯α. Thus

an(qαθ^)pq¯α.

Let H(x)=P[ξx] be the distribution function of ξ. The assumption of symmetry implies H(x)= 1H(x). Then the percentile interval has coverage

P[θCpc]=P[qα/2θq1α/2]=P[an(qα/2θ^)an(θ^θ)an(q1α/2θ^)]P[q¯α/2ξq¯1α/2]=H(q¯α/2)H(q¯1α/2)=H(q¯1α/2)H(q¯α/2)=1α.

The convergence holds by (10.18) and (10.20). The following equality uses the definition of H, the nextto-last is the symmetry of H, and the final equality is the definition of q¯α. This establishes Theorem 10.13.

Theorem 10.13 seems quite general, but it critically rests on the assumption that the asymptotic distribution ξ is symmetrically distributed about zero. This may seem innocuous since conventional asymptotic distributions are normal and hence symmetric, but it deserves further scrutiny. It is not merely a technical assumption - an examination of the steps in the preceeding argument isolate quite clearly that if the symmetry assumption is violated then the asymptotic coverage will not be 1α. While Theorem 10.13 does show that the percentile interval is asymptotically valid for a conventional asymptotically normal estimator, the reliance on symmetry in the argument suggests that the percentile method will work poorly when the finite sample distribution is asymmetric. This turns out to be the case and leads us to consider alternative methods in the following sections. It is also worthwhile to investigate a finite sample justification for the percentile interval based on a heuristic analogy due to Efron.

Assume that there exists an unknown but strictly increasing transformation ψ(θ) such that ψ(θ^) ψ(θ) has a pivotal distribution H(u) (does not vary with θ ) which is symmetric about zero. For example, if θ^N(θ,σ2) we can set ψ(θ)=θ/σ. Alternatively, if θ^=exp(μ^) and μ^N(μ,σ2) then we can set ψ(θ)= log(θ)/σ

To assess the coverage of the percentile interval, observe that since the distribution H is pivotal the bootstrap distribution ψ(θ^)ψ(θ^) also has distribution H(u). Let q¯α be the αth  quantile of the distribution H. Since qα is the αth quantile of the distribution of θ^ and ψ(θ^)ψ(θ^) is a monotonic transformation of θ^, by the quantile equivariance property we deduce that q¯α+ψ(θ^)=ψ(qα). The percentile interval has coverage

P[θCpc]=P[qα/2θq1α/2]=P[ψ(qα/2)ψ(θ)ψ(q1α/2)]=P[ψ(θ^)ψ(qα/2)ψ(θ^)ψ(θ)ψ(θ^)ψ(q1α/2)]=P[q¯α/2ψ(θ^)ψ(θ)q¯1α/2]=H(q¯α/2)H(q¯1α/2)=H(q¯1α/2)H(q¯α/2)=1α.

The second equality applies the monotonic transformation ψ(u) to all elements. The fourth uses the relationship q¯α+ψ(θ^)=ψ(qα). The fifth uses the defintion of H. The sixth uses the symmetry property of H, and the final is by the definition of q¯α as the αth quantile of H.

This calculation shows that under these assumptions the percentile interval has exact coverage 1α. The nice thing about this argument is the introduction of the unknown transformation ψ(u) for which the percentile interval automatically adapts. The unpleasant feature is the assumption of symmetry. Similar to the asymptotic argument the calculation strongly relies on the symmetry of distribution H(x). Without symmetry the coverage will be incorrect.

Intuitively, we expect that when the assumptions are approximately true then the percentile interval will have approximately correct coverage. Thus so long as there is a transformation ψ(u) such that ψ(θ^) ψ(θ) is approximately pivotal and symmetric about zero, then the percentile interval should work well.

This argument has the following application. Suppose that the parameter of interest is θ=exp(μ) where μ=E[Y] and suppose Y has a pivotal symmetric distribution about μ. Then even though θ^= exp(Y¯) does not have a symmetric distribution, the percentile interval applied to θ^ will have the correct coverage, because the monotonic transformation log(θ^) has a pivotal symmetric distribution.

10.17 Bias-Corrected Percentile Interval

The accuracy of the percentile interval depends critically upon the assumption that the sampling distribution is approximately symmetrically distributed. This excludes finite sample bias, for an estimator which is biased cannot be symmetrically distributed. Many contexts in which we want to apply bootstrap methods (rather than asymptotic) are when the parameter of interest is a nonlinear function of the model parameters, and nonlinearity typically induces estimation bias. Consequently it is difficult to expect the percentile method to generally have accurate coverage.

To reduce the bias problem Efron (1982) introduced the bias-corrected (BC) percentile interval. The justification is heuristic but there is considerable evidence that the bias-corrected method is an important improvement on the percentile interval. The construction is based on the assumption is that there is a an unknown but strictly increasing transformation ψ(θ) and unknown constant z0 such that

Z=ψ(θ^)ψ(θ)+z0N(0,1).

(The assumption that Z is normal is not critical. It could be replaced by any known symmetric and invertible distribution.) Let Φ(x) denote the normal distribution function, Φ1(p) its quantile function, and zα=Φ1(α) the normal critical values. Then the BC interval can be constructed from the bootstrap estimators θ^b and bootstrap quantiles qα as follows. Set

p=1Bb=1B1{θ^bθ^}

and

z0=Φ1(p).

p is a measure of median bias, and z0 is p transformed into normal units. If the bias of θ^ is zero then p=0.5 and z0=0. If θ^ is upwards biased then p<0.5 and z0<0. Conversely if θ^ is dowward biased then p>0.5 and z0>0. Define for any α an adjusted version

x(α)=Φ(zα+2z0).

If z0=0 then x(α)=α. If z0>0 then x(α)>α, and conversely when x(α)<0. The BC interval is

Cbc=[qx(α/2),qx(1α/2)].

Essentially, rather than going from the 2.5 to 97.5 quantile, the BC interval uses adjusted quantiles, with the degree of adjustment depending on the extent of the bias.

The construction of the BC interval is not intuitive. We now show that assumption (10.21) implies that the BC interval has exact coverage. (10.21) implies that

P[ψ(θ^)ψ(θ)+z0x]=Φ(x).

Since the distribution is pivotal the result carries over to the bootstrap distribution

P[ψ(θ^)ψ(θ^)+z0x]=Φ(x).

Evaluating (10.26) at x=z0 we find P[ψ(θ^)ψ(θ^)0]=Φ(z0) which implies P[θ^θ^]=Φ(z0). Inverting, we obtain

z0=Φ1(P[θ^θ^])

which is the probability limit of (10.23) as B. Thus the unknown z0 is recoved by (10.23), and we can treat z0 as if it were known.

From (10.26) we deduce that

x(α)=Φ(zα+2z0)=P[ψ(θ^)ψ(θ^)zα+z0)]=P[θ^ψ1(ψ(θ^)+z0+zα)].

This equation shows that ψ1(ψ(θ^)+z0+zα) equals the x(α)th bootstrap quantile. That is, qx(α)= ψ1(ψ(θ^)+z0+zα). Hence we can write (10.25) as

Cbc=[ψ1(ψ(θ^)+z0+zα/2),ψ1(ψ(θ^)+z0+z1α/2)].

It has coverage probability

P[θCbc]=P[ψ1(ψ(θ^)+z0+zα/2)θψ1(ψ(θ^)+z0+z1α/2)]=P[ψ(θ^)+z0+zα/2ψ(θ)ψ(θ^)+z0+z1α/2]=P[zα/2ψ(θ^)ψ(θ)+z0z1α/2]=P[z1α/2Zzα/2]=Φ(z1α/2)Φ(zα/2)=1α.

The second equality applies the transformation ψ(θ). The fourth equality uses the model (10.21) and the fact zα=z1α. This shows that the BC interval (10.25) has exact coverage under the assumption (10.21).

Furthermore, under the assumptions of Theorem 10.13, the BC interval has asymptotic coverage probability 1α, since the bias correction is asymptotically negligible.

An important property of the BC percentile interval is that it is transformation-respecting (like the percentile interval). To see this, observe that p is invariant to transformations because it is a probability, and thus z0 and x(α) are invariant. Since the interval is constructed from the x(α/2) and x(1α/2) quantiles, the quantile equivariance property shows that the interval is transformation-respecting.

The bootstrap BC percentile intervals for the four estimators are reported in Table 13.2. They are generally similar to the percentile intervals, though the intervals for σ2 and μ are somewhat shifted to the right.

In Stata, BC percentile confidence intervals can be obtained by using the command estat bootstrap after an estimation command which calculates standard errors via the bootstrap.

10.18 BCa Percentile Interval

A further improvement on the BC interval was made by Efron (1987) to account for the skewness in the sampling distribution, which can be modeled by specifying that the variance of the estimator depends on the parameter. The resulting bootstrap accelerated bias-corrected percentile interval (BCa) has improved performance on the BC interval, but requires a bit more computation and is less intuitive to understand.

The construction is a generalization of that for the BC intervals. The assumption is that there is an unknown but strictly increasing transformation ψ(θ) and unknown constants a and z0 such that

Z=ψ(θ^)ψ(θ)1+aψ(θ)+z0N(0,1).

(As before, the assumption that Z is normal could be replaced by any known symmetric and invertible distribution.)

The constant z0 is estimated by (10.23) just as for the BC interval. There are several possible estimators of a. Efron’s suggestion is a scaled jackknife estimator of the skewness of θ^ :

a^=i=1n(θ¯θ^(i))36(i=1n(θ¯θ^(i))2)3/2θ¯=1ni=1nθ^(i).

The jackknife estimator of a^ makes the BCa interval more computationally costly than other intervals.

Define for any α the adjusted version

x(α)=Φ(z0+zα+z01a(zα+z0)).

The BCa percentile interval is

Cbca=[qx(α/2),qx(1α/2)].

Note that x(α) simplifies to (10.24) and Cbca  simplies to Cbc  when a=0. While Cbc  improves on Cpc  by correcting the median bias, Cbca  makes a further correction for skewness.

The BCa interval is only well-defined for values of α such that a(zα+z0)<1. (Or equivalently, if α<Φ(a1z0) for a>0 and α>Φ(a1z0) for a<0.)

The BCa interval, like the BC and percentile intervals, is transformation-respecting. Thus if [qx(α/2),qx(1α/2)] is the BCa interval for θ, then [m(qx(α/2)),m(qx(1α/2))] is the BCα interval for ϕ=m(θ) when m(θ) is monotone.

We now give a justification for the BCa interval. The most difficult feature to understand is the estimator a^ for a. This involves higher-order approximations which are too advanced for our treatment, so we instead refer readers to Chapter 4.1.4 of Shao and Tu (1995) and simply assume that a is known.

We now show that assumption (10.28) with a known implies that Cbca  has exact coverage. The argument is essentially the same as that given in the previous section. Assumption (10.28) implies that the bootstrap distribution satisfies

P[ψ(θ^)ψ(θ^)1+aψ(θ^)+z0x]=Φ(x).

Evaluating at x=z0 and inverting we obtain (10.27) which is the same as for the BC interval. Thus the estimator (10.23) is consistent as B and we can treat z0 as if it were known.

From (10.29) we deduce that

x(α)=P[ψ(θ^)ψ(θ^)1+aψ(θ^)zα+z01a(zα+z0)]=P[θ^ψ1(ψ(θ^)+zα+z01a(zα+z0))].

This shows that ψ1(ψ(θ^)+zα+z01a(zα+z0)) equals the x(α)th bootstrap quantile. Hence we can write Cbca  as

Cbca=[ψ1(ψ(θ^)+zα/2+z01a(zα/2+z0)),ψ1(ψ(θ^)+z1α/2+z01a(z1α/2+z0))].

It has coverage probability

P[θCbca]=P[ψ1(ψ(θ^)+zα/2+z01a(zα/2+z0))θψ1(ψ(θ^)+z1α/2+z01a(z1α/2+z0))]=P[ψ(θ^)+zα/2+z01a(zα/2+z0)ψ(θ)ψ(θ^)+z1α/2+z01a(z1α/2+z0)]=P[zα/2ψ(θ^)ψ(θ)1+aψ(θ)+z0z1α/2]=P[z1α/2Zzα/2]=1α.

The second equality applies the transformation ψ(θ). The fourth equality uses the model (10.28) and the fact zα=z1α. This shows that the BCa interval Cbca  has exact coverage under the assumption (10.28) with a known.

The bootstrap BCa percentile intervals for the four estimators are reported in Table 13.2. They are generally similar to the BC intervals, though the intervals for σ2 and μ are slightly shifted to the right.

In Stata, BCa intervals can be obtained by using the command estat bootstrap, bca or the command estat bootstrap, all after an estimation command which calculates standard errors via the bootstrap using the bca option.

10.19 Percentile-t Interval

In many cases we can obtain improvement in accuracy by bootstrapping a studentized statistic such as a t-ratio. Let θ^ be an estimator of a scalar parameter θ and s(θ^) a standard error. The sample t-ratio is

T=θ^θs(θ^).

The bootstrap t-ratio is

T=θ^θ^s(θ^)

where s(θ^) is the standard error calculated on the bootstrap sample. Notice that the bootstrap t-ratio is centered at the parameter estimator θ^. This is because θ^ is the “true value” in the bootstrap universe.

The percentile-t interval is formed using the distribution of T. This can be calculated via the bootstrap algorithm. On each bootstrap sample the estimator θ^ and its standard error s(θ^) are calculated, and the t-ratio T=(θ^θ^)/s(θ^) calculated and stored. This is repeated B times. The αth quantile qα is estimated by the αth empirical quantile (or any quantile estimator) from the B bootstrap draws of T.

The bootstrap percentile-t confidence interval is defined as

Cpt=[θ^s(θ^)q1α/2,θ^s(θ^)qα/2].

The form may appear unusual when compared with the percentile interval. The left endpoint is determined by the upper quantile of the distribution of T, and the right endpoint is determined by the lower quantile. As we show below, this construction is important for the interval to have correct coverage when the distribution is not symmetric.

When the estimator is asymptotically normal and the standard error a reliable estimator of the standard deviation of the distribution we would expect the t-ratio T to be roughly approximated by the normal distribution. In this case we would expect q0.975q0.0252. Departures from this baseline occur as the distribution becomes skewed or fat-tailed. If the bootstrap quantiles depart substantially from this baseline it is evidence of substantial departure from normality. (It may also indicate a programming error, so in these cases it is wise to triple-check!)

The percentile-t has the following advantages. First, when the standard error s(θ^) is reasonably reliable, the percentile-t bootstrap makes use of the information in the standard error, thereby reducing the role of the bootstrap. This can improve the precision of the method relative to other methods. Second, as we show later, the percentile-t intervals achieve higher-order accuracy than the percentile and BC percentile intervals. Third, the percentile-t intervals correspond to the set of parameter values “not rejected” by one-sided t-tests using bootstrap critical values (bootstrap tests are presented in Section 10.21).

The percentile-t interval has the following disadvantages. First, they may be infeasible when standard error formula are unknown. Second, they may be practically infeasible when standard error calculations are computationally costly (since the standard error calculation needs to be performed on each bootstrap sample). Third, the percentile-t may be unreliable if the standard errors s(θ^) are unreliable and thus add more noise than clarity. Fourth, the percentile-t interval is not translation preserving, unlike the percentile, BC percentile, and BCa percentile intervals.

It is typical to calculate percentile-t intervals with t-ratios constructed with conventional asymptotic standard errors. But this is not the only possible implementation. The percentile-t interval can be constructed with any data-dependent measure of scale. For example, if θ^ is a two-step estimator for which it is unclear how to construct a correct asymptotic standard error, but we know how to calculate a standard error s(θ^) appropriate for the second step alone, then s(θ^) can be used for a percentile-t-type interval as described above. It will not possess the higher-order accuracy properties of the following section, but it will satisfy the conditions for first-order validity.

Furthermore, percentile-t intervals can be constructed using bootstrap standard errors. That is, the statistics T and T can be computed using bootstrap standard errors sθ^boot . This is computationally costly as it requires what we call a “nested bootstrap”. Specifically, for each bootstrap replication, a random sample is drawn, the bootstrap estimate θ^ computed, and then B additional bootstrap sub-samples drawn from the bootstrap sample to compute the bootstrap standard error for the bootstrap estimate θ^. Effectively B2 bootstrap samples are drawn and estimated, which increases the computational requirement by an order of magnitude.

We now describe the distribution theory for first-order validity of the percentile-t bootstrap.

First, consider the smooth function model, where θ^=g(μ^) and s(θ^)=1nG^V^G^ with bootstrap analogs θ^=g(μ^) and s(θ^)=1nG^V^G^. From Theorems 6.10,10.7, and 10.8

T=n(θ^θ)G^V^G^d

and

T=n(θ^θ^)G^V^G^dZ

where ZN(0,1). This shows that the sample and bootstrap t-ratios have the same asymptotic distribution.

This motivates considering the broader situation where the sample and bootstrap t-ratios have the same asymptotic distribution but not necessarily normal. Thus assume that

TdξTdξ

for some continuous distribution ξ. (10.31) implies that the quantiles of T converge in probability to those of ξ, that is qαpqα where qα is the αth quantile of ξ. This and (10.30) imply

P[θCpt]=P[θ^s(θ^)q1α/2θθ^s(θ^)qα/2]=P[qα/2Tq1α/2]P[qα/2ξq1α/2]=1α.

Thus the percentile-t is asymptotically valid. Theorem 10.14 If (10.30) and (10.31) hold where ξ is continuously distributed, then P[θCpt]1α as n

The bootstrap percentile-t intervals for the four estimators are reported in Table 13.2. They are similar but somewhat different from the percentile-type intervals, and generally wider. The largest difference arises with the interval for σ2 which is noticably wider than the other intervals.

10.20 Percentile-t Asymptotic Refinement

This section uses the theory of Edgeworth and Cornish-Fisher expansions introduced in Chapter 9.8-9.10 of Probability and Statistics for Economists. This theory will not be familiar to most students. If you are interested in the following refinement theory it is advisable to start by reading these sections of Probability and Statistics for Economists.

The percentile-t interval can be viewed as the intersection of two one-sided confidence intervals. In our discussion of Edgeworth expansions for the coverage probability of one-sided asymptotic confidence intervals (following Theorem 7.15 in the context of functions of regression coefficients) we found that one-sided asymptotic confidence intervals have accuracy to order O(n1/2). We now show that the percentile-t interval has improved accuracy.

Theorem 9.13 of Probability and Statistics for Economists showed that the Cornish-Fisher expansion for the quantile qα of a t-ratio T in the smooth function model takes the form

qα=zα+n1/2p11(zα)+O(n1)

where p11(x) is an even polynomial of order 2 with coefficients depending on the moments up to order 8. The bootstrap quantile qα has a similar Cornish-Fisher expansion

qα=zα+n1/2p11(zα)+Op(n1)

where p11(x) is the same as p11(x) except that the population moments are replaced by the corresponding sample moments. Sample moments are estimated at the rate n1/2. Thus we can replace p11 with p11 without affecting the order of this expansion:

qα=zα+n1/2p11(zα)+Op(n1)=qα+Op(n1).

This shows that the bootstrap quantiles qα of the studentized t-ratio are within Op(n1) of the exact quantiles qα.

By the Edgeworth expansion Delta method (Theorem 9.12 of Probability and Statistics for Economists), T and T+(qαqα)=T+Op(n1) have the same Edgeworth expansion to order O(n1). Thus

P[Tqα]=P[T+(qαqα)qα]=P[Tqα]+O(n1)=α+O(n1).

Thus the coverage of the percentile-t interval is

P[θCpt]=P[qα/2Tq1α/2]=P[qα/2Tq1α/2]+O(n1)=1α+O(n1).

This is an improved rate of convergence relative to the one-sided asymptotic confidence interval. Theorem 10.15 Under the assumptions of Theorem 9.11 of Probability and Statistics for Economists, P[θCpt]=1α+O(n1).

The following definition of the accuracy of a confidence interval is useful.

Definition 10.3 A confidence set C for θ is kth-order accurate if

P[θC]=1α+O(nk/2).

Examining our results we find that one-sided asymptotic confidence intervals are first-order accurate but percentile-t intervals are second-order accurate. When a bootstrap confidence interval (or test) achieves high-order accuracy than the analogous asymptotic interval (or test), we say that the bootstrap method achieves an asymptotic refinement. Here, we have shown that the percentile-t interval achieves an asymptotic refinement.

In order to achieve this asymptotic refinement it is important that the t-ratio T (and its bootstrap counter-part T ) are constructed with asymptotically valid standard errors. This is because the first term in the Edgeworth expansion is the standard normal distribution and this requires that the t-ratio is asymptotically normal. This also has the practical finite-sample implication that the accuracy of the percentile-t interval in practice depends on the accuracy of the standard errors used to construct the t-ratio.

We do not go through the details, but normal-approximation bootstrap intervals, percentile bootstrap intervals, and bias-corrected percentile bootstrap intervals are all first-order accurate and do not achieve an asymptotic refinement.

The BCa interval, however, can be shown to be asymptotically equivalent to the percentile-t interval, and thus achieves an asymptotic refinement. We do not make this demonstration here as it is advanced. See Section 3.10.4 of Hall (1992).

Peter Hall
Peter Gavin Hall (1951-2016) of Australia was one of the most influential and
prolific theoretical statisticians in history. He made wide-ranging contributions.
Some of the most relevant for econometrics are theoretical investigations of
bootstrap methods and nonparametric kernel methods.

10.21 Bootstrap Hypothesis Tests

To test the hypothesis M0:θ=θ0 against M1:θθ0 the most common approach is a t-test. We reject H0 in favor of H1 for large absolute values of the t-statistic T=(θ^θ0)/s(θ^) where θ^ is an estimator of θ and s(θ^) is a standard error for θ^. For a bootstrap test we use the bootstrap algorithm to calculate the critical value. The bootstrap algorithm samples with replacement from the dataset. Given a bootstrap sample the bootstrap estimator θ^ and standard error s(θ^) are calculated. Given these values the bootstrap t-statistic is T=(θ^θ^)/s(θ^). There are two important features about the bootstrap t-statistic. First, T is centered at the sample estimate θ^, not at the hypothesized value θ0. This is done because θ^ is the true value in the bootstrap universe, and the distribution of the t-statistic must be centered at the true value within the bootstrap sampling framework. Second, T is calculated using the bootstrap standard error s(θ^). This allows the bootstrap to incorporate the randomness in standard error estimation.

The failure to properly center the bootstrap statistic at θ^ is a common error in applications. Often this is because the hypothesis to be tested is H0:θ=0, so the test statistic is T=θ^/s(θ^). This intuitively suggests the bootstrap statistic T=θ^/s(θ^), but this is wrong. The correct bootstrap statistic is T= (θ^θ^)/s(θ^)

The bootstrap algorithm creates B draws T(b)=(θ^(b)θ^)/s(θ^(b)),b=1,,B. The bootstrap 100α critical value is q1α, where qα is the αth  quantile of the absolute values of the bootstrap t-ratios |T(b)|. For a 100α test we reject M0:θ=θ0 in favor of H1:θθ0 if |T|>q1α and fail to reject if |T|q1α.

It is generally better to report p-values rather than critical values. Recall that a p-value is p=1 Gn(|T|) where Gn(u) is the null distribution of the statistic |T|. The bootstrap p-value is defined as p=1Gn(|T|), where Gn(u) is the bootstrap distribution of |T|. This is estimated from the bootstrap algorithm as

p=1Bb=1B1{|T(b)|>|T|},

the percentage of bootstrap t-statistics that are larger than the observed t-statistic. Intuitively, we want to know how “unusual” is the observed statistic T when the null hypothesis is true. The bootstrap algorithm generates a large number of independent draws from the distribution T (which is an approximation to the unknown distribution of T ). If the percentage of the |T| that exceed |T| is very small (say 1 ) this tells us that |T| is an unusually large value. However, if the percentage is larger, say 15, then we cannot interpret |T| as unusually large.

If desired, the bootstrap test can be implemented as a one-sided test. In this case the statistic is the signed version of the t-ratio, and bootstrap critical values are calculated from the upper tail of the distribution for the alternative H1:θ>θ0, and from the lower tail for the alternative H1:θ<θ0. There is a connection between the one-sided tests and the percentile-t confidence interval. The latter is the set of parameter values θ which are not rejected by either one-sided 100α/2 bootstrap t-test.

Bootstrap tests can also be conducted with other statistics. When standard errors are not available or are not reliable we can use the non-studentized statistic T=θ^θ0. The bootstrap version is T=θ^θ^. Let qα be the αth  quantile of the bootstrap statistics |θ^(b)θ^|. A bootstrap 100α test rejects H0:θ=θ0 if |θ^θ0|>q1α. The bootstrap p-value is

p=1Bb=1B1{|θ^(b)θ^|>|θ^θ0|}.

Theorem 10.16 If (10.30) and (10.31) hold where ξ is continuously distributed, then the bootstrap critical value satisfies q1αpq1α where q1α is the 1αth quantile of |ξ|. The bootstrap test “Reject M0 in favor of M1 if |T|>q1α” has asymptotic size α:P[|T|>q1αH0]α as n. In the smooth function model the t-test (with correct standard errors) has the following performance.

Theorem 10.17 Under the assumptions of Theorem 9.11 of Probability and Statistics for Economists,

q1α=z¯1α+op(n1)

where z¯α=Φ1((1+α)/2) is the αth quantile of |Z|. The asymptotic test “Reject M0 in favor of M1 if |T|>z¯1α” has accuracy

P[|T|>z¯1αH0]=1α+O(n1)

and the bootstrap test “Reject M0 in favor of M1 if |T|>q1α” has accuracy

P[|T|>q1αM0]=1α+o(n1).

This shows that the bootstrap test achieves a refinement relative to the asymptotic test.

The reasoning is as follows. We have shown that the Edgeworth expansion for the absolute t-ratio takes the form

P[|T|x]=2Φ(x)1+n12p2(x)+o(n1).

This means the asymptotic test has accuracy of order O(n1).

Given the Edgeworth expansion, the Cornish-Fisher expansion for the αth quantile qα of the distribution of |T| takes the form

qα=z¯α+n1p21(z¯α)+o(n1).

The bootstrap quantile qα has the Cornish-Fisher expansion

qα=z¯α+n1p21(z¯α)+o(n1)=z¯α+n1p21(z¯α)+op(n1)=qα+op(n1)

where p21(x) is the same as p21(x) except that the population moments are replaced by the corresponding sample moments. The bootstrap test has rejection probability, using the Edgeworth expansion Delta method (Theorem 11.12 of of Probability and Statistics for Economists)

P[|T|>q1αB0]=P[|T|+(q1αq1α)>q1α]=P[|T|>q1α]+o(n1)=1α+o(n1)

as claimed.

10.22 Wald-Type Bootstrap Tests

If θ is a vector then to test M0:θ=θ0 against M1:θθ0 at size α, a common test is based on the Wald statistic W=(θ^θ0)V^θ^1(θ^θ0) where θ^ is an estimator of θ and V^θ^ is a covariance matrix estimator. For a bootstrap test we use the bootstrap algorithm to calculate the critical value. The bootstrap algorithm samples with replacement from the dataset. Given a bootstrap sample the bootstrap estimator θ^ and covariance matrix estimator V^θ^ are calculated. Given these values the bootstrap Wald statistic is

W=(θ^θ^)V^θ^1(θ^θ^).

As for the t-test it is essential that the bootstrap Wald statistic W is centered at the sample estimator θ^ instead of the hypothesized value θ0. This is because θ^ is the true value in the bootstrap universe.

Based on B bootstrap replications we calculate the αth quantile qα of the distribution of the bootstrap Wald statistics W. The bootstrap test rejects M0 in favor of H1 if W>q1α. More commonly, we calculate a bootstrap p-value. This is

p=1Bb=1B1{W(b)>W}.

The asymptotic performance of the Wald test mimics that of the t-test. In general, the bootstrap Wald test is first-order correct (achieves the correct size asymptotically) and under conditions for which an Edgeworth expansion exists, has accuracy

P[W>q1αH0]=1α+o(n1)

and thus achieves a refinement relative to the asymptotic Wald test.

If a reliable covariance matrix estimator V^θ^ is not available a Wald-type test can be implemented with any positive-definite weight matrix instead of V^θ^. This includes simple choices such as the identity matrix. The bootstrap algorithm can be used to calculate critical values and p-values for the test. So long as the estimator θ^ has an asymptotic distribution this bootstrap test will be asymptotically firstorder valid. The test will not achieve an asymptotic refinement but provides a simple method to test hypotheses when covariance matrix estimates are not available.

10.23 Criterion-Based Bootstrap Tests

A criterion-based estimator takes the form

β^=argminβJ(β)

for some criterion function J(β). This includes least squares, maximum likelihood, and minimum distance. Given a hypothesis M0:θ=θ0 where θ=r(β), the restricted estimator which satisfies H0 is

β~=argminr(β)=θ0J(β).

A criterion-based statistic to test H0 is

J=minr(β)=θ0J(β)minβJ(β)=J(β~)J(β^).

A criterion-based test rejects H0 for large values of J. A bootstrap test uses the bootstrap algorithm to calculate the critical value.

In this context we need to be a bit thoughtful about how to construct bootstrap versions of J. It might seem natural to construct the exact same statistic on the bootstrap samples as on the original sample, but this is incorrect. It makes the same error as calculating a t-ratio or Wald statistic centered at the hypothesized value. In the bootstrap universe, the true value of θ is not θ0, rather it is θ^=r(β^). Thus when using the nonparametric bootstrap, we want to impose the constraint r(β)=r(β^)=θ^ to obtain the bootstrap version of J.

Thus, the correct way to calculate a bootstrap version of J is as follows. Generate a bootstrap sample by random sampling from the dataset. Let J(β) be the the bootstrap version of the criterion. On a bootstrap sample calculate the unrestricted estimator β^=argminβJ(β) and the restricted version β~= argminJ(β) where θ^=r(β^). The bootstrap statistic is r(β)=θ^

J=minr(β)=θ^J(β)minβJ(β)=J(β~)J(β^).

Calculate J on each bootstrap sample. Take the 1αth  quantile q1α. The bootstrap test rejects H0 in favor of H1 if J>q1α. The bootstrap p-value is

p=1Bb=1B1{J(b)>J}.

Special cases of criterion-based tests are minimum distance tests, F tests, and likelihood ratio tests. Take the F test for a linear hypothesis Rβ=θ0. The F statistic is

F=(σ~2σ^2)/qσ^2/(nk)

where σ^2 is the unrestricted estimator of the error variance, σ~2 is the restricted estimator, q is the number of restrictions and k is the number of estimated coefficients. The bootstrap version of the F statistic is

F=(σ~2σ^2)/qσ^2/(nk)

where σ^2 is the unrestricted estimator on the bootstrap sample, and σ~2 is the restricted estimator which imposes the restriction θ^=Rβ^.

Take the likelihood ratio (LR) test for the hypothesis r(β)=θ0. The LR test statistic is

LR=2(n(β^)n(β~))

where β^ is the unrestricted MLE and β~ is the restricted MLE (imposing r(β)=θ0 ). The bootstrap version is

LR=2(n(β^)n(β~))

where n(β) is the log-likelihood function calculated on the bootstrap sample, β^ is the unrestricted maximizer, and β~ is the restricted maximizer imposing the restriction r(β)=r(β^).

10.24 Parametric Bootstrap

Throughout this chapter we have described the most popular form of the bootstrap known as the nonparametric bootstrap. However there are other forms of the bootstrap algorithm including the parametric bootstrap. This is appropriate when there is a full parametric model for the distribution as in likelihood estimation.

First, consider the context where the model specifies the full distribution of the random vector Y, e.g. YF(yβ) where the distribution function F is known but the parameter β is unknown. Let β^ be an estimator of β such as the maximum likelihood estimator. The parametric bootstrap algorithm generates bootstrap observations Yi by drawing random vectors from the distribution function F(yβ^). When this is done, the true value of β in the bootstrap universe is β^. Everything which has been discussed in the chapter can be applied using this bootstrap algorithm.

Second, consider the context where the model specifies the conditional distribution of the random vector Y given the random vector X, e.g. YXF(yX,β). An example is the normal linear regression model, where YXN(Xβ,σ2). In this context we can hold the regressors Xi fixed and then draw the bootstrap observations Yi from the conditional distribution F(yXi,β^). In the example of the normal regression model this is equivalent to drawing a normal error eiN(0,σ^2) and then setting Yi=Xiβ^+ ei. Again, in this algorithm the true value of β is β^ and everything which is discussed in this chapter can be applied as before.

Third, consider tests of the hypothesis r(β)=θ0. In this context we can also construct a restricted estimator β~ (for example the restricted MLE) which satisfies the hypothesis r(β~)=θ0. Then we can generate bootstrap samples by simulating from the distribution YiF(yβ~), or in the conditional context from YiF(yXi,β~). When this is done the true value of β in the bootstrap is β~ which satisfies the hypothesis. So in this context the correct values of the bootstrap statistics are

T=θ^θ0s(θ^)W=(θ^θ0)V^θ^1(θ^θ0)J=minr(β)=θ0J(β)minβJ(β)LR=2(maxβn(β)maxr(β)=θ0n(β))

and

F=(σ~2σ^2)/qσ^2/(nk)

where σ^2 is the unrestricted estimator on the bootstrap sample and σ~2 is the restricted estimator which imposes the restriction Rβ=θ0.

The primary advantage of the parametric bootstrap (relative to the nonparametric bootstrap) is that it will be more accurate when the parametric model is correct. This may be quite important in small samples. The primary disadvantage of the parametric bootstrap is that it can be inaccurate when the parametric model is incorrect.

10.25 How Many Bootstrap Replications?

How many bootstrap replications should be used? There is no universally correct answer as there is a trade-off between accuracy and computation cost. Computation cost is essentially linear in B. Accuracy (either standard errors or p-values) is proportional to B1/2. Improved accuracy can be obtained but only at a higher computational cost.

In most empirical research, most calculations are quick and investigatory, not requiring full accuracy. But final results (those going into the final version of the paper) should be accurate. Thus it seems reasonable to use asymptotic and/or bootstrap methods with a modest number of replications for daily calculations, but use a much larger B for the final version.

In particular, for final calculations, B=10,000 is desired, with B=1000 a minimal choice. In contrast, for daily quick calculations values as low as B=100 may be sufficient for rough estimates. A useful way to think about the accuracy of bootstrap methods stems from the calculation of pvalues. The bootstrap p-value p is an average of B Bernoulli draws. The variance of the simulation estimator of p is p(1p)/B, which is bounded above by 1/4B. To calculate the p-value within, say, 0.01 of the true value with 95 probability requires a standard error below 0.005. This is ensured if B10,000.

Stata by default sets B=50. This is useful for verification that a program runs but is a poor choice for empirical reporting. Make sure that you set B to the value you want.

10.26 Setting the Bootstrap Seed

Computers do not generate true random numbers but rather pseudo-random numbers generated by a deterministic algorithm. The algorithms generate sequences which are indistinguishable from random sequences so this is not a worry for bootstrap applications.

The methods, however, necessarily require a starting value known as a “seed”. Some packages (including Stata and MATLAB) implement this with a default seed which is reset each time the statistical package is started. This means if you start the package fresh, run a bootstrap program (e.g. a “do” file in Stata), exit the package, restart the package and then rerun the bootstrap program, you should obtain exactly the same results. If you instead run the bootstrap program (e.g. “do” file) twice sequentially without restarting the package, the seed is not reset so a different set of pseudo-random numbers will be generated and the results from the two runs will be different.

The R package has a different implementation. When R is loaded the random number seed is generated based on the computer’s clock (which results in an essentially random starting seed). Therefore if you run a bootstrap program in R, exit, restart, and rerun, you will obtain a different set of random draws and therefore a different bootstrap result.

Packages allow users to set their own seed. (In Stata, the command is set seed #. In MATLAB the command is rng(#). In R the command is set. seed (#).) If the seed is set to a specific number at the start of a file then the exact same pseudo-random numbers will be generated each time the program is run. If this is the case, the results of a bootstrap calculation (standard error or test) will be identical across computer runs.

The fact that the bootstrap results can be fixed by setting the seed in the replication file has motivated many researchers to follow this choice. They set the seed at the start of the replication file so that repeated executions result in the same numerical findings.

Fixing seeds, however, should be done cautiously. It may be a wise choice for a final calculation (when a paper is finished) but is an unwise choice for daily calculations. If you use a small number of replications in your preliminary work, say B=100, the bootstrap calculations will be inaccurate. But as you run your results again and again (as is typical in empirical projects) you will obtain the same numerical standard errors and test results, giving you a false sense of stability and accuracy. If instead a different seed is used each time the program is run then the bootstrap results will vary across runs, and you will observe that the results vary across these runs, giving you important and meaningful information about the (lack of) accuracy in your results. One way to ensure this is to set the seed according to the current clock. In MATLAB use the command rng(‘shuffle’). In R use set. seed (seed=NULL). Stata does not have this option.

These considerations lead to a recommended hybrid approach. For daily empirical investigations do not fix the bootstrap seed in your program unless you have it set by the clock. For your final calculations set the seed to a specific arbitrary choice, and set B=10,000 so that the results are insensitive to the seed.

10.27 Bootstrap Regression

A major focus of this textbook has been on the least squares estimator β^ in the projection model. The bootstrap can be used to calculate standard errors and confidence intervals for smooth functions of the coefficient estimates.

The nonparametric bootstrap algorithm, as described before, samples observations randomly with replacement from the dataset, creating the bootstrap sample {(Y1,X1),,(Yn,Xn)}, or in matrix notation (Y,X) It is important to recognize that entire observations (pairs of Yi and Xi ) are sampled. This is often called the pairs bootstrap.

Given this bootstrap sample, we calculate the regression estimator

β^=(XX)1(XY).

This is repeated B times. The bootstrap standard errors are the standard deviations across the draws and confidence intervals are constructed from the empirical quantiles across the draws.

What is the nature of the bootstrap distribution of β^ ? It is useful to start with the distribution of the bootstrap observations (Yi,Xi), which is the discrete distribution which puts mass 1/n on each observation pair (Yi,Xi). The bootstrap universe can be thought of as the empirical scatter plot of the observations. The true value of the projection coefficient in this bootstrap universe is

(E[XiXi])1(E[XiYi])=(1ni=1nXiXi)1(1ni=1nXiYi)=β^

We see that the true value in the bootstrap distribution is the least squares estimator β^.

The bootstrap observations satisfy the projection equation

Yi=Xiβ^+eiE[Xiei]=0.

For each bootstrap pair (Yi,Xi)=(Yj,Xj) the true error ei=e^j equals the least squares residual from the original dataset. This is because each bootstrap pair corresponds to an actual observation.

A technical problem (which is typically ignored) is that it is possible for XX to be singular in a simulated bootstrap sample, in which case the least squares estimator β^ is not uniquely defined. Indeed, the probability is positive that XX is singular. For example, the probability that a bootstrap sample consists entirely of one observation repeated n times is n(n1). This is a small probability, but positive. A more significant example is sparse dummy variable designs where it is possible to draw an entire sample with only one observed value for the dummy variable. For example, if a sample has n=20 observations with a dummy variable with treatment (equals 1) for only three of the 20 observations, the probability is 4 that a bootstrap sample contains entirely non-treated values (all 0’s). 4 is quite high!

The standard approach to circumvent this problem is to compute β^ only if XX is non-singular as defined by a conventional numerical tolerance and treat it as missing otherwise. A better solution is to define a tolerance which bounds XX away from non-singularity. Define the ratio of the smallest eigenvalue of the bootstrap design matrix to that of the data design matrix

λ=λmin(XX)λmin(XX).

If, in a given bootstrap replication, λ<τ is smaller than a given tolerance (Shao and Tu (1995,p.291) recommend τ=1/2 ) then the estimator can be treated as missing, or we can define the trimming rule

β^={β^ if λτβ^ if λ<τ

This ensures that the bootstrap estimator β^ will be well behaved.

10.28 Bootstrap Regression Asymptotic Theory

Define the least squares estimator β^, its bootstrap version β^ as in (10.32), and the transformations θ^=g(β^) and θ^=r(β^) for some smooth transformation r. Let V^β and V^θ denote heteroskedasticityrobust covariance matrix estimators for β^ and θ^, and let V^β and V^θ be their bootstrap versions. When θ is scalar define the standard errors s(θ^)=n1V^θ and s(θ^)=n1V^θ. Define the t-ratios T= (θ^θ)/s(θ^) and bootstrap version T=(θ^θ^)/s(θ^). We are interested in the asymptotic distributions of β^,θ^ and T

Since the bootstrap observations satisfy the model (10.33), we see by standard calculations that

n(β^β^)=(1ni=1nXiXi)1(1ni=1nXiei)

By the bootstrap WLLN

1ni=1nXiXipE[XiXi]=Q

and by the bootstrap CLT

1ni=1nXieidN(0,Ω)

where Ω=E[XXe2]. Again applying the bootstrap WLLN we obtain

V^βpVβ=Q1ΩQ1

and

V^θpVθ=RVβR

where R=R(β).

Combining with the bootstrap CMT and delta method we establish the asymptotic distribution of the bootstrap regression estimator.

Theorem 10.18 Under Assumption 7.2, as n

n(θ^θ^)dN(0,Vβ).

If Assumption 7.3 also holds then

n(θ^θ^)dN(0,Vθ).

If Assumption 7.4 also holds then

TdN(0,1).

This means that the bootstrap confidence interval and testing methods all apply for inference on β and θ. This includes the percentile, BC percentile, BCa, and percentile-t intervals, and hypothesis tests based on t-tests, Wald tests, MD tests, LR tests and F tests.

To justify bootstrap standard errors we also need to verify the uniform square integrability of β^ and θ^. This is technically challenging because the least squares estimator involves matrix inversion which is not globally continuous. A partial solution is to use the trimmed estimator (10.34). This bounds the moments of β^ by those of n1i=1nXiei. Since this is a sample mean, Theorem 10.10 applies and V^β is bootstrap consistent for Vβ. However, this does not ensure that V^θ will be consistent for V^θ unless the function r(x) satisfies the conditions of Theorem 10.10. For general applications use a trimmed estimator for the bootstrap variance. For some τn=O(en/8) define

Zn=n(θ^θ^)Z=z1{Znτn}Z=1Bb=1BZ(b)V^θboot ,τ=1B1b=1B(Z(b)Z)(Z(b)Z).

The matrix V^θboot  is a trimmed bootstrap estimator of the variance of Zn=n(θ^θ). The associated bootstrap standard error for θ^ (in the scalar case) is s(θ^)=n1V^θboot .

By an application of Theorems 10.11 and 10.12, we find that this estimator V^θboot  is consistent for the asymptotic variance.

Theorem 10.19 Under Assumption 7.2 and 7.3, as n,V^θboot,τpVθ

Programs such as Stata use the untrimmed estimator V^θboot  rather than the trimmed estimator V^θboot ,τ. This means that we should be cautious about interpreting reported bootstrap standard errors especially for nonlinear functions such as ratios.

10.29 Wild Bootstrap

Take the linear regression model

Y=Xβ+eE[eX]=0.

What is special about this model is the conditional mean restriction. The nonparametric bootstrap (which samples the pairs (Yi,Xi) i.i.d. from the original observations) does not make use of this restriction. Consequently the bootstrap distribution for (Y,X) does not satisfy the conditional mean restriction and therefore does not satisfy the linear regression assumption. To improve precision it seems reasonable to impose the conditional mean restriction on the bootstrap distribution.

A natural approach is to hold the regressors Xi fixed and then draw the errors ei in some way which imposes a conditional mean of zero. The simplest approach is to draw the errors independent from the regressors, perhaps from the empirical distribution of the residuals. This procedure is known as the residual bootstrap. However, this imposes independence of the errors from the regressors which is much stronger than the conditional mean assumption. This is generally undesirable.

A method which imposes the conditional mean restriction while allowing general heteroskedasticity is the wild bootstrap. It was proposed by Liu (1988) and extended by Mammon (1993). The method uses auxiliary random variables ξ which are i.i.d., mean zero, and variance 1 . The bootstrap observations are then generated as Yi=Xiβ^+ei with ei=e^iξi, where the regressors Xi are held fixed at their sample values, β^ is the sample least squares estimator, and e^i are the least squares residuals, which are also held fixed at their sample values.

This algorithm generates bootstrap errors ei which are conditionally mean zero. Thus the bootstrap pairs (Yi,Xi) satisfy a linear regression with the “true” coefficient of β^. The conditional variance of the wild bootstrap errors ei are E[ei2Xi]=e^i2. This means that the conditional variance of the bootstrap estimator β^ is

E[(β^β^)(β^β^)X]=(XX)1(i=1nXiXie^i2)(XX)1

which is the White estimator of the variance of β^. Thus the wild bootstrap replicates the appropriate first and second moments of the distribution.

Two distributions have been proposed for the auxilary variables ξi both of which are two-point discrete distributions. The first are Rademacher random variables which satisfy P[ξ=1]=12 and P[ξ=1]= 12. The second is the Mammen (1993) two-point distribution

P[ξ=1+52]=5125P[ξ=152]=5+125

The reasoning behind the Mammen distribution is that this choice implies E[ξ3]=1, which implies that the third central moment of β^ matches the natural nonparametric estimator of the third central moment of β^. Since the wild bootstrap matches the first three moments, the percentile-t interval and one-sided t-tests can be shown to achieve asymptotic refinements.

The reasoning behind the Rademacher distribution is that this choice implies E[ξ4]=1, which implies that the fourth central moment of β^ matches the natural nonparametric estimator of the fourth central moment of β^. If the regression errors e are symmetrically distributed (so the third moment is zero) then the first four moments are matched. In this case the wild bootstrap should have even better performance, and additionally two-sided t-tests can be shown to achieve an asymptotic refinement. When the regression error is not symmetrically distributed these asymptotic refinements are not achieved. Limited simulation evidence for one-sided t-tests presented in Davidson and Flachaire (2008) suggests that the Rademacher distribution (used with the restricted wild bootstrap) has better performance and is their recommendation.

For hypothesis testing improved precision can be obtained by the restricted wild bootstrap. Consider tests of the hypothesis H0:r(β)=0. Let β~ be a CLS or EMD estimator of β subject to the restriction r(β~)=0. Let e~i=YiXiβ~ be the constrained residuals. The restricted wild bootstrap algorithm generates observations as Yi=Xiβ~+ei with ei=e~iξi. With this modification β~ is the true value in the bootstrap universe so the null hypothesis M0 holds. Thus bootstrap tests are constructed the same as for the parametric bootstrap using a restricted parameter estimator.

10.30 Bootstrap for Clustered Observations

Bootstrap methods can also be applied in to clustered samples though the methodological literature is relatively thin. Here we review methods discussed in Cameron, Gelbach and Miller (2008).

Let Yg=(Y1g,,Yngg) and Xg=(X1g,,Xngg) denote the ng×1 vector of dependent variables and ng×k matrix of regressors for the gth cluster. A linear regression model using cluster notation is Yg= Xgβ+eg where eg=(e1g,,engg) is an ng×1 error vector. The sample has G cluster pairs (Yg,Xg).

The pairs cluster bootstrap samples G cluster pairs (Yg,Xg) to create the bootstrap sample. Least squares is applied to the bootstrap sample to obtain the coefficient estimators. By repeating B times bootstrap standard errors for coefficients estimates, or functions of the coefficient estimates, can be calculated. Percentile, BC percentile, and BCa confidence intervals can be calculated.

The BCa interval requires an estimator of the acceleration coefficient a which is a scaled jackknife estimate of the third moment of the estimator. In the context of clustered observations the delete-cluster jackknife should be used for estimation of a.

Furthermore, on each bootstrap sample the cluster-robust standard errors can be calculated and used to compute bootstrap t-ratios, from which percentile-t confidence intervals can be calculated. tions as

The wild cluster bootstrap fixes the clusters and regressors, and generates the bootstrap observa-

Yg=Xgβ^+egeg=e^iξg

where ξg is a scalar auxilary random variable as described in the previous section. Notice that ξg is interacted with the entire vector of residuals from cluster g. Cameron, Gelbach and Miller (2008) follow the recommendation of Davidson and Flachaire (2008) and use Rademacher random variables for ξg.

For hypothesis testing, Cameron, Gelbach and Miller (2008) recommend the restricted wild cluster bootstrap. For tests of M0:r(β)=0 let β~ be a CLS or EMD estimator of β subject to the restriction r(β~)= 0. Let e~g=YgXgβ~ be the constrained cluster-level residuals. The restricted wild cluster bootstrap algorithm generates observations as

Yg=Xgβ~+egeg=e~iξg

On each bootstrap sample the test statistic for M0 (t-ratio, Wald, LR, or F) is applied. Since the bootstrap algorithm satisfies M0 these statistics are centered at the hypothesized value. p-values are then calculated conventionally and used to assess the significance of the test statistic.

There are several reasons why conventional asymptotic approximations may work poorly with clustered observations. First, while the sample size n may be large, the effective sample size is the number of clusters G. This is because when the dependence structure within each cluster is unconstrained the central limit theorem effectively treats each cluster as a single observation. Thus, if G is small we should treat inference as a small sample problem. Second, cluster-robust covariance matrix estimation explicitly treats each cluster as a single observation. Consequently the accuracy of normal approximations to tratios and Wald statistics is more accurately viewed as a small sample distribution problem. Third, when cluster sizes ng are heterogeneous this means that the estimation problems just described also involve heterogeneous variances. Specifically, heterogeneous cluster sizes induces a high degree of effective heteroskedasticity (since the variance of a within-cluster sum is proportional to ng ). When G is small this means that cluster-robust inference is similar to finite-sample inference with a small heteroskedastic sample. Fourth, interest often concerns treatment which is applied at the level of a cluster (such as the effect of tracking discussed in Section 4.21). If the number of treated clusters is small this is equivalent to estimation with a highly sparse dummy variable design in which case cluster-robust covariance matrix estimation can be unreliable.

These concerns suggest that conventional normal approximations may be poor in the context of clustered observations with a small number of groups G, motivating the use of bootstrap methods. However, these concerns also can cause challenges with the accuracy of bootstrap approximations. When the number of clusters G is small, the cluster sizes ng heterogeneous, or the number of treated clusters small, bootstrap methods may be inaccurate. In such cases inference should proceed cautiously.

To illustrate the use of the pairs cluster bootstrap, Table 10.4 reports the estimates of the example from Section 4.21 of the effect of tracking on testscores from Duflo, Dupas, and Kremer (2011). In addition to the asymptotic cluster standard error we report the cluster jackknife and cluster bootstrap standard errors as well as three percentile-type confidence intervals. We use 10,000 bootstrap replications. In this example the asymptotic, jackknife, and cluster bootstrap standard errors are identical, which reflects the good balance of this particular regression design.

Table 10.4: Comparison of Methods for Estimate of Effect of Tracking

Coefficient on Tracking 0.138
Asymptotic cluster s.e. (0.078)
Jackknife cluster s.e. (0.078)
Cluster Bootstrap s.e. (0.078)
95 Percentile Interval [0.013,0.291]
95 Percentile Interval [0.015,0.289]
95 Percentile Interval [0.018,0.286]

In Stata, to obtain cluster bootstrap standard errors and confidence intervals use the options cluster (id) vce(bootstrap, reps # )) , where id is the cluster variable and # is the number of replications.

10.31 Technical Proofs*

Some of the asymptotic results are facilitated by the following convergence result.

Theorem 10.20 Marcinkiewicz WLLN If ui are independent and uniformly integrable, then for any r> 1 , as n,nri=1n|ui|rp0.

Proof of Theorem 10.20

nri=1n|ui|r(n1max1in|ui|)r11ni=1n|ui|p0

by the WLLN, Theorem 6.15, and r>1.

Proof of Theorem 10.1 Fix ϵ>0. Since ZnpZ there is an n sufficiently large such that

P[ZnZ>ϵ]<ϵ.

Since the event ZnZ>ϵ is non-random under the conditional probability P, for such n,

P[ZnZ>ϵ]={0 with probability exceeding 1ϵ1 with probability less than ϵ.

Since ε is arbitrary we conclude P[ZnZ>ϵ]p0 as required.

Proof of Theorem 10.2 Fix ϵ>0. By Markov’s inequality (B.36), the facts (10.12) and (10.13), and finally the Marcinkiewicz WLLN (Theorem 10.20) with r=2 and ui=Yi,

P[Y¯Y¯>ϵ]ϵ2EY¯Y¯2=ϵ2tr(var[Y¯])=ϵ2tr(1nΣ^)ϵ2n2i=1nYiYip0

This establishes that Y¯Y¯p0.

Since Y¯μp0 by the WLLN, Y¯μp0 by Theorem 10.1. Since Y¯μ=Y¯Y¯+Y¯μ, we deduce that Y¯μp0.

Proof of Theorem 10.4 We verify conditions for the multivariate Lindeberg CLT (Theorem 6.4). (We cannot use the Lindeberg-Lévy CLT because the conditional distribution depends on n.) Conditional on Fn, the bootstrap draws YiY¯ are i.i.d. with mean 0 and covariance matrix Σ^. Set vn2=λmin(Σ^). Note that by the WLLN, vn2pv2=λmin(Σ)>0. Thus for n sufficiently large, vn2>0 with high probability. Fix ϵ>0. Equation (6.2) equals

1nvn2i=1nE[YiY¯21{YiY¯2ϵnvn2}]=1vn2E[YiY¯21{YiY¯2ϵnvn2}]1ϵnvn4EYiY¯424ϵnvn4EYi4=24ϵn2vn4i=1nYi40.

The second inequality uses Minkowski’s inequality (B.34), Liapunov’s inequality (B.35), and the cr inequality (B.6). The following equality is EYi4=n1i=1nYi4, which is similar to (10.10). The final convergence holds by the Marcinkiewicz WLLN (Theorem 10.20) with r=2 and ui=Yi2. The conditions for Theorem 6.4 hold and we conclude

Σ^1/2n(Y¯Y¯)dN(0,I).

Since Σ^pΣ we deduce that n(Y¯Y¯)dN(0,Σ) as claimed.

Proof of Theorem 10.10 For notational simplicity assume θ and μ are scalar. Set hi=h(Yi). The assumption that the pth derivative of g(u) is bounded implies |g(p)(u)|C for some C<. Taking a pth  order Taylor series expansion

θ^θ^=g(h¯)g(h¯)=j=1p1g(j)(h¯)j!(h¯h¯)j+g(p)(ζn)p!(h¯h¯)p

where ζn lies between h¯ and h¯. This implies

|zn|=n|θ^θ^|nj=1pcjh¯h¯j

where cj=|g(j)(h¯)|/j ! for j<p and cp=C/p !. We find that the fourth central moment of the normalized bootstrap estimator Zn=n(θ^θ^) satisfies the bound

E[Zn4]r=44parn2E|h¯h¯|r

where the coefficients ar are products of the coefficients cj and hence each Op(1). We see that E[Zn4]= Op(1) if n2E|h¯h¯|r=Op(1) for r=4,,4p.

We show this holds for any r4 using Rosenthal’s inequality (B.50), which states that for each r there is a constant Rr< such that

n2E|h¯h¯|r=n2rE|i=1n(hih¯)|rn2rRr{(nE(hih¯)2)r/2+nE|hih¯|r}=Rr{n2r/2σ^r+1nr2i=1n|hih¯|r}

Since E[hi2]<,σ^2=Op(1), so the first term in (10.36) is Op(1). Also, by the Marcinkiewicz WLLN (Theorem 10.20), nr/2i=1n|hih¯|r=op (1) for any r1, so the second term in (10.36) is op(1) for r4. Thus for all r4,(10.36) is Op(1) and thus (10.35) is Op(1). We deduce that Zn is uniformly square integrable, and the bootstrap estimate of variance is consistent.

This argument can be extended to vector-valued means and estimates.

Proof of Theorem 10.12 We show that EZn4=Op(1). Theorem 6.13 shows that Zn is uniformly square integrable. Since ZndZ, Theorem 6.14 implies that var[Zn]var[Z]=Vβ as stated.

Set hi=h(Yi). Since G(x)=xg(x) is continuous in a neighborhood of μ, there exists η>0 and M< such that xμ2η implies tr(G(x)G(x))M. By the WLLN and bootstrap WLLN there is an n sufficiently large such that h¯nμη and h¯nh¯nη with probability exceeding 1η. On this event, xh¯nη implies tr(G(x)G(x))M. Using the mean-value theorem at a point ζn intermediate between h¯n and h¯n

Zn41{h¯nh¯nη}n2g(h¯n)g(h¯n)41{h¯nh¯nη}n2G(ζn)(h¯nh¯n)4M2n2h¯nh¯n4.

Then

EZn4E[Zn41{h¯nh¯nη}]+τn4E[1{h¯nh¯n>η}]M2n2Eh¯nh¯n4+τn4P(h¯nh¯n>η).

In (10.17) we showed that the first term in (10.37) is Op(1) in the scalar case. The vector case follows by element-by-element expansion.

Now take the second term in (10.37). We apply Bernstein’s inequality for vectors (B.41). Note that h¯nh¯n=n1i=1nui with ui=hih¯n and jth element uji=hjih¯jn. The ui are i.i.d., mean zero, E[uji2]=σ^j2=Op(1), and satisfy the bound |uji|2maxi,j|hji|=Bn, say. Bernstein’s inequality states that

P[h¯nh¯n>η]2mexp(n1/2η22m2n1/2maxjσ^j2+2mn1/2Bnη/3).

Theorem 6.15 shows that n1/2Bn=op(1). Thus the expression in the denominator of the parentheses in (10.38) is op (1) as n, . It follows that for n sufficiently large (10.38) is Op(exp(n1/2)). Hence the second term in (10.37) is Op(exp(n1/2))op(exp(n1/2))=op(1) by the assumption on τn.

We have shown that the two terms in (10.37) are each Op(1). This completes the proof.

10.32 Exercises

Exercise 10.1 Find the jackknife estimator of variance of the estimator μ^r=n1i=1nYir for μr=E[Yir].

Exercise 10.2 Show that if the jackknife estimator of variance of β^ is V^β^jack , then the jackknife estimator of variance of θ^=a+Cβ^ is V^θ^jack=CV^β^jackC.

Exercise 10.3 A two-step estimator such as (12.49) is β^=(i=1nW^iW^i)1(i=1nW^iYi) where W^i=A^Zi and A^=(ZZ)1ZX. Describe how to construct the jackknife estimator of variance of β^.

Exercise 10.4 Show that if the bootstrap estimator of variance of β^ is V^β^boot , then the bootstrap estimator of variance of θ^=a+Cβ^ is V^θ^boot =CV^β^boot C.

Exercise 10.5 Show that if the percentile interval for β is [L,U] then the percentile interval for a+cβ is [a+cL,a+cU].

Exercise 10.6 Consider the following bootstrap procedure. Using the nonparametric bootstrap, generate bootstrap samples, calculate the estimate θ^ on these samples and then calculate

T=(θ^θ^)/s(θ^),

where s(θ^) is the standard error in the original data. Let qα/2 and q1α/2 denote the α/2th and 1α/2th quantiles of T, and define the bootstrap confidence interval

C=[θ^+s(θ^)qα/2,θ^+s(θ^)q1α/2].

Show that C exactly equals the percentile interval. Exercise 10.7 Prove Theorem 10.6.

Exercise 10.8 Prove Theorem 10.7.

Exercise 10.9 Prove Theorem 10.8.

Exercise 10.10 Let Yi be i.i.d., μ=E[Y]>0, and θ=μ1. Let μ^=Y¯n be the sample mean and θ^=μ^1.

  1. Is θ^ unbiased for θ ?

  2. If θ^ is biased, can you determine the direction of the bias E[θ^θ] (up or down)?

  3. Is the percentile interval appropriate in this context for confidence interval construction?

Exercise 10.11 Consider the following bootstrap procedure for a regression of Y on X. Let β^ denote the OLS estimator and e^i=YiXiβ^ the OLS residuals.

  1. Draw a random vector (X,e) from the pair {(Xi,e^i):i=1,,n}. That is, draw a random integer i from [1,2,,n], and set X=Xi and e=e^i. Set Y=Xβ^+e. Draw (with replacement) n such vectors, creating a random bootstrap data set (Y,X).

  2. Regress Y on X, yielding OLS estimator β^ and any other statistic of interest.

Show that this bootstrap procedure is (numerically) identical to the nonparametric bootstrap.

Exercise 10.12 Take p as defined in (10.22) for the BC percentile interval. Show that it is invariant to replacing θ with g(θ) for any strictly monotonically increasing transformation g(θ). Does this extend to z0 as defined in (10.23)?

Exercise 10.13 Show that if the percentile-t interval for β is [L,U] then the percentile-t interval for a+cβ is [a+bL,a+bU].

Exercise 10.14 You want to test M0:θ=0 against M1:θ>0. The test for M0 is to reject if Tn=θ^/s(θ^)>c where c is picked so that Type I error is α. You do this as follows. Using the nonparametric bootstrap, you generate bootstrap samples, calculate the estimates θ^ on these samples and then calculate T= θ^/s(θ^). Let q1α denote the 1αth quantile of T. You replace c with q1α, and thus reject H0 if Tn=θ^/s(θ^)>q1α. What is wrong with this procedure?

Exercise 10.15 Suppose that in an application, θ^=1.2 and s(θ^)=0.2. Using the nonparametric bootstrap, 1000 samples are generated from the bootstrap distribution, and θ^ is calculated on each sample. The θ^ are sorted, and the 0.025th and 0.975th quantiles of the θ^ are .75 and 1.3, respectively.

  1. Report the 95 percentile interval for θ.

  2. With the given information, can you calculate the 95% BC percentile interval or percentile-t interval for θ ?

Exercise 10.16 Take the normal regression model Y=Xβ+e with eXN(0,σ2) where we know the MLE equals the least squares estimators β^ and σ^2.

  1. Describe the parametric regression bootstrap for this model. Show that the conditional distribution of the bootstrap observations is YiFnN(Xiβ^,σ^2). (b) Show that the distribution of the bootstrap least squares estimator is β^FnN(β^,(XX)1σ^2).

  2. Show that the distribution of the bootstrap t-ratio with a homoskedastic standard error is T tnk.

Exercise 10.17 Consider the model Y=Xβ+e with E[eX]=0,Y scalar, and X a k vector. You have a random sample (Yi,Xi:i=1,,n). You are interested in estimating the regression function m(x)= E[YX=x] at a fixed vector x and constructing a 95 confidence interval.

  1. Write down the standard estimator and asymptotic confidence interval for m(x).

  2. Describe the percentile bootstrap confidence interval for m(x).

  3. Describe the percentile-t bootstrap confidence interval for m(x).

Exercise 10.18 The observed data is {Yi,Xi}R×Rk,k>1,i=1,,n. Take the model Y=Xβ+e with E[Xe]=0.

  1. Write down an estimator for μ3=E[e3].

  2. Explain how to use the percentile method to construct a 90% confidence interval for μ3 in this specific model.

Exercise 10.19 Take the model Y=Xβ+e with E[Xe]=0. Describe the bootstrap percentile confidence interval for σ2=E[e2].

Exercise 10.20 The model is Y=X1β1+X2β2+e with E[Xe]=0 and X2 scalar. Describe how to test H0:β2=0 against H1:β20 using the nonparametric bootstrap.

Exercise 10.21 The model is Y=X1β1+X2β2+e with E[Xe]=0, and both X1 and X2k×1. Describe how to test M0:β1=β2 against M1:β1β2 using the nonparametric bootstrap.

Exercise 10.22 Suppose a Ph.D. student has a sample (Yi,Xi,Zi:i=1,,n) and estimates by OLS the equation Y=Zα+Xβ+e where α is the coefficient of interest. She is interested in testing H0:α=0 against H1:α0. She obtains α^=2.0 with standard error s(α^)=1.0 so the value of the t-ratio for H0 is T=α^/s(α^)=2.0. To assess significance, the student decides to use the bootstrap. She uses the following algorithm

  1. Samples (Yi,Xi,Zi) randomly from the observations. (Random sampling with replacement). Creates a random sample with n observations.

  2. On this pseudo-sample, estimates the equation Yi=Ziα+Xiβ+ei by OLS and computes standard errors, including s(α^). The t-ratio for H0,T=α^/s(α^) is computed and stored.

  3. This is repeated B=10,000 times.

  4. The 0.95th empirical quantile q.95=3.5 of the bootstrap absolute t-ratios |T| is computed.

  5. The student notes that while |T|=2>1.96 (and thus an asymptotic 5 size test rejects M0 ), |T|= 2<q.95=3.5 and thus the bootstrap test does not reject M0. As the bootstrap is more reliable, the student concludes that M0 cannot be rejected in favor of H1. Question: Do you agree with the student’s method and reasoning? Do you see an error in her method?

Exercise 10.23 Take the model Y=X1β1+X2β2+e with E[Xe]=0 and scalar X1 and X2. The parameter of interest is θ=β1β2. Show how to construct a confidence interval for θ using the following three methods.

  1. Asymptotic Theory.

  2. Percentile Bootstrap.

  3. Percentile-t Bootstrap.

Your answer should be specific to this problem, not general.

Exercise 10.24 Take the model Y=X1β1+X2β2+e with i.i.d observations, E[Xe]=0 and scalar X1 and X2. Describe how you would construct the percentile-t bootstrap confidence interval for θ=β1/β2.

Exercise 10.25 The model is i.i.d. data, i=1,,n,Y=Xβ+e and E[eX]=0. Does the presence of conditional heteroskedasticity invalidate the application of the nonparametric bootstrap? Explain.

Exercise 10.26 The RESET specification test for nonlinearity in a random sample (due to Ramsey (1969)) is the following. The null hypothesis is a linear regression Y=Xβ+e with E[eX]=0. The parameter β is estimated by OLS yielding predicted values Y^i. Then a second-stage least squares regression is estimated including both Xi and Y^i

Yi=Xiβ~+(Y^i)2γ~+e~i

The RESET test statistic R is the squared t-ratio on γ~.

A colleague suggests obtaining the critical value for the test using the bootstrap. He proposes the following bootstrap implementation.

  • Draw n observations (Yi,Xi) randomly from the observed sample pairs (Yi,Xi) to create a bootstrap sample.

  • Compute the statistic R on this bootstrap sample as described above.

  • Repeat this B times. Sort the bootstrap statistics R, take the 0.95th quantile and use this as the critical value.

  • Reject the null hypothesis if R exceeds this critical value, otherwise do not reject.

Is this procedure a correct implementation of the bootstrap in this context? If not, propose a modification.

Exercise 10.27 The model is Y=Xβ+e with E[Xe]0. We know that in this case, the least squares estimator may be biased for the parameter β. We also know that the nonparametric BC percentile interval is (generally) a good method for confidence interval construction in the presence of bias. Explain whether or not you expect the BC percentile interval applied to the least squares estimator will have accurate coverage in this context.

Exercise 10.28 In Exercise 9.26 you estimated a cost function for 145 electric companies and tested the restriction θ=β3+β4+β5=1. (a) Estimate the regression by unrestricted least squares and report standard errors calculated by asymptotic, jackknife and the bootstrap.

  1. Estimate θ=β3+β4+β5 and report standard errors calculated by asymptotic, jackknife and the bootstrap.

  2. Report confidence intervals for θ using the percentile and BCa methods.

Exercise 10.29 In Exercise 9.27 you estimated the Mankiw, Romer, and Weil (1992) unrestricted regression. Let θ be the sum of the second, third, and fourth coefficients.

  1. Estimate the regression by unrestricted least squares and report standard errors calculated by asymptotic, jackknife and the bootstrap.

  2. Estimate θ and report standard errors calculated by asymptotic, jackknife and the bootstrap.

  3. Report confidence intervals for θ using the percentile and BC methods.

Exercise 10.30 In Exercise 7.28 you estimated a wage regression with the cps09mar dataset and the subsample of white Male Hispanics. Further restrict the sample to those never-married and live in the Midwest region. (This sample has 99 observations.) As in subquestion (b) let θ be the ratio of the return to one year of education to the return of one year of experience.

  1. Estimate θ and report standard errors calculated by asymptotic, jackknife and the bootstrap.

  2. Explain the discrepancy between the standard errors.

  3. Report confidence intervals for θ using the BC percentile method.

Exercise 10.31 In Exercise 4.26 you extended the work from Duflo, Dupas, and Kremer (2011). Repeat that regression, now calculating the standard error by cluster bootstrap. Report a BCa confidence interval for each coefficient.