7  Asymptotic Theory for Least Squares

7.1 Introduction

It turns out that the asymptotic theory of least squares estimation applies equally to the projection model and the linear CEF model. Therefore the results in this chapter will be stated for the broader projection model described in Section 2.18. Recall that the model is Y=Xβ+e with the linear projection coefficient β=(E[XX])1E[XY].

Maintained assumptions in this chapter will be random sampling (Assumption 1.2) and finite second moments (Assumption 2.1). We restate these here for clarity.

Assumption 7.1

  1. The variables (Yi,Xi),i=1,,n, are i.i.d.

  2. E[Y2]<.

  3. EX2<.

  4. QXX=E[XX] is positive definite.

The distributional results will require a strengthening of these assumptions to finite fourth moments. We discuss the specific conditions in Section 7.3.

7.2 Consistency of Least Squares Estimator

In this section we use the weak law of large numbers (WLLN, Theorem 6.1 and Theorem 6.2) and continuous mapping theorem (CMT, Theorem 6.6) to show that the least squares estimator β^ is consistent for the projection coefficient β.

This derivation is based on three key components. First, the OLS estimator can be written as a continuous function of a set of sample moments. Second, the WLLN shows that sample moments converge in probability to population moments. And third, the CMT states that continuous functions preserve convergence in probability. We now explain each step in brief and then in greater detail. First, observe that the OLS estimator

β^=(1ni=1nXiXi)1(1ni=1nXiYi)=Q^XX1Q^XY

is a function of the sample moments Q^XX=1ni=1nXiXi and Q^XY=1ni=1nXiYi.

Second, by an application of the WLLN these sample moments converge in probability to their population expectations. Specifically, the fact that (Yi,Xi) are mutually i.i.d. implies that any function of (Yi,Xi) is i.i.d., including XiXi and XiYi. These variables also have finite expectations under Assumption 7.1. Under these conditions, the WLLN implies that as n,

Q^XX=1ni=1nXiXip[XX]=QXX

and

Q^XY=1ni=1nXiYip[XY]=QXY

Third, the CMT allows us to combine these equations to show that β^ converges in probability to β. Specifically, as n,

β^=Q^XX1Q^XYpQXX1QXY=β.

We have shown that β^pβ as n. In words, the OLS estimator converges in probability to the projection coefficient vector β as the sample size n gets large.

To fully understand the application of the CMT we walk through it in detail. We can write

β^=g(Q^XX,Q^XY)

where g(A,b)=A1b is a function of A and b. The function g(A,b) is a continuous function of A and b at all values of the arguments such that A1 exists. Assumption 7.1 specifies that QXX is positive definite, which means that QXX1 exists. Thus g(A,b) is continuous at A=QXX. This justifies the application of the CMT in (7.2).

For a slightly different demonstration of (7.2) recall that (4.6) implies that

β^β=Q^XX1Q^Xe

where

Q^Xe=1ni=1nXiei.

The WLLN and (2.25) imply

Q^XepE[Xe]=0.

Therefore

β^β=Q^XX1Q^XepQXX10=0

which is the same as β^pp. Theorem 7.1 Consistency of Least Squares. Under Assumption 7.1, Q^XXp QXX,Q^XYQXYp,Q^XX1pQXX1,Q^Xep0, and β^3pβ as n

Theorem 7.1 states that the OLS estimator β^ converges in probability to β as n increases and thus β^ is consistent for β. In the stochastic order notation, Theorem 7.1 can be equivalently written as

β^=β+op(1).

To illustrate the effect of sample size on the least squares estimator consider the least squares regression

log( wage )=β1 education +β2 experience +β3 experience 2+β4+e.

We use the sample of 24,344 white men from the March 2009 CPS. We randomly sorted the observations and sequentially estimated the model by least squares starting with the first 5 observations and continuing until the full sample is used. The sequence of estimates are displayed in Figure 7.1. You can see how the least squares estimate changes with the sample size. As the number of observations increases it settles down to the full-sample estimate β^1=0.114.

Figure 7.1: The Least-Squares Estimator as a Function of Sample Size

7.3 Asymptotic Normality

We started this chapter discussing the need for an approximation to the distribution of the OLS estimator β^. In Section 7.2 we showed that β^ converges in probability to β. Consistency is a good first step, but in itself does not describe the distribution of the estimator. In this section we derive an approximation typically called the asymptotic distribution.

The derivation starts by writing the estimator as a function of sample moments. One of the moments must be written as a sum of zero-mean random vectors and normalized so that the central limit theorem can be applied. The steps are as follows.

Take equation (7.3) and multiply it by n. This yields the expression

n(β^β)=(1ni=1nXiXi)1(1ni=1nXiei)

This shows that the normalized and centered estimator n(β^β) is a function of the sample average n1i=1nXiXi and the normalized sample average n1/2i=1nXiei.

The random pairs (Yi,Xi) are i.i.d., meaning that they are independent across i and identically distributed. Any function of (Yi,Xi) is also i.i.d. This includes ei=YiXiβ and the product Xiei. The latter is mean-zero (E[Xe]=0) and has k×k covariance matrix

Ω=E[(Xe)(Xe)]=E[XXe2].

We show below that Ω has finite elements under a strengthening of Assumption 7.1. Since Xiei is i.i.d., mean zero, and finite variance, the central limit theorem (Theorem 6.3) implies

1ni=1nXieidN(0,Ω)

We state the required conditions here.

Assumption 7.2

  1. The variables (Yi,Xi),i=1,,n, are i.i.d..

  2. E[Y4]<.

  3. EX4<.

  4. QXX=E[XX] is positive definite.

Assumption 7.2 implies that Ω<. To see this, take its jth element, E[XjXe2]. Theorem 2.9.6 shows that E[e4]<. By the expectation inequality (B.30), the jth element of Ω is bounded by

|E[XjXe2]|E|XjXe2|=E[|Xj||X|e2].

By two applications of the Cauchy-Schwarz inequality (B.32), this is smaller than

(E[Xj2X2])1/2(E[e4])1/2(E[Xj4])1/4(E[X4])1/4(E[e4])1/2<

where the finiteness holds under Assumption 7.2.2 and 7.2.3. Thus Ω<.

An alternative way to show that the elements of Ω are finite is by using a matrix norm (See Appendix A.23). Then by the expectation inequality, the Cauchy-Schwarz inequality, Assumption 7.2.3, and E[e4]<,

ΩEXXe2=E[X2e2](EX4)1/2(E[e4])1/2<.

This is a more compact argument (often described as more elegant) but such manipulations should not be done without understanding the notation and the applicability of each step of the argument.

Regardless, the finiteness of the covariance matrix means that we can apply the multivariate CLT (Theorem 6.3).

Theorem 7.2 Assumption 7.2 implies that

Ω<

and

1ni=1nXieidN(0,Ω)

as n

Putting together (7.1), (7.5), and (7.7),

n(β^β)dQXX1 N(0,Ω)=N(0,QXX1ΩQXX1)

as n. The final equality follows from the property that linear combinations of normal vectors are also normal (Theorem 5.2).

We have derived the asymptotic normal approximation to the distribution of the least squares estimator.

Theorem 7.3 Asymptotic Normality of Least Squares Estimator Under Assumption 7.2, as n

n(β^β)dN(0,Vβ)

where QXX=E[XX],Ω=E[XXe2], and

Vβ=QXX1ΩQXX1.

In the stochastic order notation, Theorem 7.3 implies that β^=β+Op(n1/2) which is stronger than (7.4).

The matrix Vβ=QXX1ΩQXX1 is the variance of the asymptotic distribution of n(β^β). Consequently, Vβ is often referred to as the asymptotic covariance matrix of β^. The expression Vβ=QXX1ΩQXX1 is called a sandwich form as the matrix Ω is sandwiched between two copies of QXX1. It is useful to compare the variance of the asymptotic distribution given in (7.8) and the finite-sample conditional variance in the CEF model as given in (4.10):

Vβ^=var[β^X]=(XX)1(XDX)(XX)1.

Notice that Vβ^ is the exact conditional variance of β^ and Vβ is the asymptotic variance of n(β^β). Thus Vβ should be (roughly) n times as large as Vβ^, or VβnVβ^. Indeed, multiplying (7.9) by n and distributing we find

nVβ^=(1nXX)1(1nXDX)(1nXX)1

which looks like an estimator of Vβ. Indeed, as n,nVβ^pVβ. The expression Vβ^ is useful for practical inference (such as computation of standard errors and tests) as it is the variance of the estimator β^, while Vβ is useful for asymptotic theory as it is well defined in the limit as n goes to infinity. We will make use of both symbols and it will be advisable to adhere to this convention.

There is a special case where Ω and Vβ simplify. Suppose that

cov(XX,e2)=0.

Condition (7.10) holds in the homoskedastic linear regression model but is somewhat broader. Under (7.10) the asymptotic variance formulae simplify as

Ω=E[XX]E[e2]=QXXσ2Vβ=QXX1ΩQXX1=QXX1σ2Vβ0.

In (7.11) we define Vβ0=QXX1σ2 whether (7.10) is true or false. When (7.10) is true then Vβ=Vβ0, otherwise VβVβ0. We call Vβ0 the homoskedastic asymptotic covariance matrix.

Theorem 7.3 states that the sampling distribution of the least squares estimator, after rescaling, is approximately normal when the sample size n is sufficiently large. This holds true for all joint distributions of (Y,X) which satisfy the conditions of Assumption 7.2. Consequently, asymptotic normality is routinely used to approximate the finite sample distribution of n(β^β).

A difficulty is that for any fixed n the sampling distribution of β^ can be arbitrarily far from the normal distribution. The normal approximation improves as n increases, but how large should n be in order for the approximation to be useful? Unfortunately, there is no simple answer to this reasonable question. The trouble is that no matter how large is the sample size, the normal approximation is arbitrarily poor for some data distribution satisfying the assumptions. We illustrate this problem using a simulation. Let Y=β1X+β2+e where X is N(0,1) and e is independent of X with the Double Pareto density f(e)=α2|e|α1,|e|1. If α>2 the error e has zero mean and variance α/(α2). As α approaches 2 , however, its variance diverges to infinity. In this context the normalized least squares slope estimator nα2α(β^1β1) has the N(0,1) asymptotic distribution for any α>2. In Figure 7.2( a) we display the finite sample densities of the normalized estimator nα2α(β^1β1), setting n=100 and varying the parameter α. For α=3.0 the density is very close to the N(0,1) density. As α diminishes the density changes significantly, concentrating most of the probability mass around zero.

Another example is shown in Figure 7.2(b). Here the model is Y=β+e where

e=urE[ur](E[u2r](E[ur])2)1/2

and uN(0,1). We show the sampling distribution of n(β^β) for n=100, varying r=1,4,6 and 8 . As r increases, the sampling distribution becomes highly skewed and non-normal. The lesson from Figure 7.2 is that the N(0,1) asymptotic approximation is never guaranteed to be accurate.

  1. Double Pareto Error

  1. Error Process (7.12)

Figure 7.2: Density of Normalized OLS Estimator

7.4 Joint Distribution

Theorem 7.3 gives the joint asymptotic distribution of the coefficient estimators. We can use the result to study the covariance between the coefficient estimators. For simplicity, take the case of two regressors, no intercept, and homoskedastic error. Assume the regressors are mean zero, variance one, with correlation ρ. Then using the formula for inversion of a 2×2 matrix,

Vβ0=σ2QXX1=σ21ρ2[1ρρ1].

Thus if X1 and X2 are positively correlated (ρ>0) then β^1 and β^2 are negatively correlated (and viceversa).

For illustration, Figure 7.3(a) displays the probability contours of the joint asymptotic distribution of β^1β1 and β^2β2 when β1=β2=0 and ρ=0.5. The coefficient estimators are negatively correlated because the regressors are positively correlated. This means that if β^1 is unusually negative, it is likely that β^2 is unusually positive, or conversely. It is also unlikely that we will observe both β^1 and β^2 unusually large and of the same sign.

This finding that the correlation of the regressors is of opposite sign of the correlation of the coefficient estimates is sensitive to the assumption of homoskedasticity. If the errors are heteroskedastic then this relationship is not guaranteed.

This can be seen through a simple constructed example. Suppose that X1 and X2 only take the values {1,+1}, symmetrically, with P[X1=X2=1]=P[X1=X2=1]=3/8, and P[X1=1,X2=1]= P[X1=1,X2=1]=1/8. You can check that the regressors are mean zero, unit variance and correlation 0.5, which is identical with the setting displayed in Figure 7.3(a).

Now suppose that the error is heteroskedastic. Specifically, suppose that E[e2X1=X2]=5/4 and E[e2X1X2]=1/4. You can check that E[e2]=1, E[X12e2]=E[X22e2]=1 and E[X1X2ei2]=7/8. There-

  1. Homoskedastic Case

  1. Heteroskedastic Case

Figure 7.3: Contours of Joint Distribution of β^1 and β^2

fore

Vβ=QXX1ΩQXX1=916[112121][178781][112121]=43[114141]

Thus the coefficient estimators β^1 and β^2 are positively correlated (their correlation is 1/4.) The joint probability contours of their asymptotic distribution is displayed in Figure 7.3(b). We can see how the two estimators are positively associated.

What we found through this example is that in the presence of heteroskedasticity there is no simple relationship between the correlation of the regressors and the correlation of the parameter estimators.

We can extend the above analysis to study the covariance between coefficient sub-vectors. For example, partitioning X=(X1,X2) and β=(β1,β2), we can write the general model as

Y=X1β1+X2β2+e

and the coefficient estimates as β^=(β^1,β^2). Make the partitions

QXX=[Q11Q12Q21Q22],Ω=[Ω11Ω12Ω21Ω22].

From (2.43)

QXX1=[Q1121Q1121Q12Q221Q2211Q21Q111Q2211]

where Q112=Q11Q12Q221Q21 and Q221=Q22Q21Q111Q12. Thus when the error is homoskedastic

cov(β^1,β^2)=σ2Q1121Q12Q221

which is a matrix generalization of the two-regressor case.

In general you can show that (Exercise 7.5)

Vβ=[V11V12V21V22]

where

V11=Q1121(Ω11Q12Q221Ω21Ω12Q221Q21+Q12Q221Ω22Q221Q21)Q1121V21=Q2211(Ω21Q21Q111Ω11Ω22Q221Q21+Q21Q111Ω12Q221Q21)Q1121V22=Q2211(Ω22Q21Q111Ω12Ω21Q111Q12+Q21Q111Ω11Q111Q12)Q2211

Unfortunately, these expressions are not easily interpretable.

7.5 Consistency of Error Variance Estimators

Using the methods of Section 7.2 we can show that the estimators σ^2=n1i=1ne^i2 and s2=(nk)1i=1ne^i2 are consistent for σ2.

The trick is to write the residual e^i as equal to the error ei plus a deviation

e^i=YiXiβ^=eiXi(β^β).

Thus the squared residual equals the squared error plus a deviation

e^i2=ei22eiXi(β^β)+(β^β)XiXi(β^β).

So when we take the average of the squared residuals we obtain the average of the squared errors, plus two terms which are (hopefully) asymptotically negligible. This average is:

σ^2=1ni=1nei22(1ni=1neiXi)(β^β)+(β^β)(1ni=1nXiXi)(β^β).

The WLLN implies that

1ni=1nei2pσ21ni=1neiXipE[eX]=01ni=1nXiXip[XX]=QXX

Theorem 7.1 shows that β^pβ. Hence (7.18) converges in probability to σ2 as desired.

Finally, since n/(nk)1 as n it follows that s2=(nnk)σ^2pσ2. Thus both estimators are consistent. Theorem 7.4 Under Assumption 7.1, σ^2pσ2 and s2pσ2 as n.

7.6 Homoskedastic Covariance Matrix Estimation

Theorem 7.3 shows that n(β^β) is asymptotically normal with asymptotic covariance matrix Vβ. For asymptotic inference (confidence intervals and tests) we need a consistent estimator of Vβ. Under homoskedasticity Vβ simplifies to Vβ0=QXX1σ2 and in this section we consider the simplified problem of estimating Vβ0.

The standard moment estimator of QXX is Q^XX defined in (7.1) and thus an estimator for QXX1 is Q^XX1. The standard estimator of σ2 is the unbiased estimator s2 defined in (4.31). Thus a natural plug-in estimator for Vβ0=QXX1σ2 is V^β0=Q^XX1s2.

Consistency of V^β0 for Vβ0 follows from consistency of the moment estimators Q^XX and s2 and an application of the continuous mapping theorem. Specifically, Theorem 7.1 established Q^XXpQXX, and Theorem 7.4 established s2pσ2. The function Vβ0=QXX1σ2 is a continuous function of QXX and σ2 so long as QXX>0, which holds true under Assumption 7.1.4. It follows by the CMT that

V^β0=Q^XX1s2pQXX1σ2=Vβ0

so that V^β0 is consistent for Vβ0.

Theorem 7.5 Under Assumption 7.1, V^β0pVβ0 as n

It is instructive to notice that Theorem 7.5 does not require the assumption of homoskedasticity. That is, V^β0 is consistent for Vβ0 regardless if the regression is homoskedastic or heteroskedastic. However, Vβ0=Vβ=avar[β^] only under homoskedasticity. Thus, in the general case V^β0 is consistent for a welldefined but non-useful object.

7.7 Heteroskedastic Covariance Matrix Estimation

Theorems 7.3 established that the asymptotic covariance matrix of n(β^β) is Vβ=QXX1ΩQXX1. We now consider estimation of this covariance matrix without imposing homoskedasticity. The standard approach is to use a plug-in estimator which replaces the unknowns with sample moments.

As described in the previous section a natural estimator for QXX1 is Q^XX1 where Q^XX defined in (7.1). The moment estimator for Ω is

Ω^=1ni=1nXiXie^i2,

leading to the plug-in covariance matrix estimator

V^βHC0=Q^XX1Ω^Q^XX1.

You can check that V^βHC0=nV^β^HC0 where V^β^HC0 is the HC0 covariance matrix estimator from (4.36).

As shown in Theorem 7.1, Q^XX1pQXX1, so we just need to verify the consistency of Ω^. The key is to replace the squared residual e^i2 with the squared error ei2, and then show that the difference is asymptotically negligible.

Specifically, observe that

Ω^=1ni=1nXiXie^i2=1ni=1nXiXiei2+1ni=1nXiXi(e^i2ei2).

The first term is an average of the i.i.d. random variables XiXiei2, and therefore by the WLLN converges in probability to its expectation, namely,

1ni=1nXiXiei2p[XXe2]=Ω.

Technically, this requires that Ω has finite elements, which was shown in (7.6).

To establish that Ω^ is consistent for Ω it remains to show that

1ni=1nXiXi(e^i2ei2)p0

There are multiple ways to do this. A reasonably straightforward yet slightly tedious derivation is to start by applying the triangle inequality (B.16) using a matrix norm:

1ni=1nXiXi(e^i2ei2)1ni=1nXiXi(e^i2ei2)=1ni=1nXi2|e^i2ei2|.

Then recalling the expression for the squared residual (7.17), apply the triangle inequality (B.1) and then the Schwarz inequality (B.12) twice

|e^i2ei2|2|eiXi(β^β)|+(β^β)XiXi(β^β)=2|ei||Xi(β^β)|+|(β^β)Xi|22|ei|Xiβ^β+Xi2β^β2

Combining (7.21) and (7.22), we find

1ni=1nXiXi(e^i2ei2)2(1ni=1nXi3|ei|)β^β+(1ni=1nXi4)β^β2=op(1).

The expression is op(1) because β^βp0 and both averages in parenthesis are averages of random variables with finite expectation under Assumption 7.2 (and are thus Op(1) ). Indeed, by Hölder’s inequality (B.31)

E[X3|e|](E[(X3)4/3])3/4(E[e4])1/4=(EX4)3/4(E[e4])1/4<.

We have established (7.20) as desired. Theorem 7.6 Under Assumption 7.2, as n,Ω^pΩ and V^βHC0pVβ

For an alternative proof of this result, see Section 7.20.

7.8 Summary of Covariance Matrix Notation

The notation we have introduced may be somewhat confusing so it is helpful to write it down in one place.

The exact variance of β^ (under the assumptions of the linear regression model) and the asymptotic variance of n(β^β) (under the more general assumptions of the linear projection model) are

Vβ^=var[β^X]=(XX)1(XDX)(XX)1Vβ=avar[n(β^β)]=QXX1ΩQXX1

The HC0 estimators of these two covariance matrices are

V^β^HC0=(XX)1(i=1nXiXie^i2)(XX)1V^βHC0=Q^XX1Ω^Q^XX1

and satisfy the simple relationship V^βHC0=nV^β^HC.

Similarly, under the assumption of homoskedasticity the exact and asymptotic variances simplify to

Vβ^0=(XX)1σ2Vβ0=QXX1σ2.

Their standard estimators are

V^β^0=(XX)1s2V^β0=Q^XX1s2

which also satisfy the relationship V^β0=nV^β^0.

The exact formula and estimators are useful when constructing test statistics and standard errors. However, for theoretical purposes the asymptotic formula (variances and their estimates) are more useful as these retain non-generate limits as the sample sizes diverge. That is why both sets of notation are useful.

7.9 Alternative Covariance Matrix Estimators*

In Section 7.7 we introduced V^βHC0 as an estimator of VβV^βHC0 is a scaled version of V^β^HC0 from Section 4.14, where we also introduced the alternative HC1, HC2, and HC3 heteroskedasticity-robust covariance matrix estimators. We now discuss the consistency properties of these estimators.

To do so we introduce their scaled versions, e.g. V^βHC1=nV^β^HC1,V^βHC2=nV^β^HC2, and V^βHC3=nV^β^HC3. These are (alternative) estimators of the asymptotic covariance matrix Vβ. First, consider V^βHC1. Notice that V^βHC1=nV^β^HC1=nnkV^βHC0 where V^βHC was defined in (7.19) and shown consistent for Vβ in Theorem 7.6. If k is fixed as n, then nnk1 and thus

V^βHC1=(1+o(1))V^βHC0pVβ.

Thus V^βHC1 is consistent for Vβ.

The alternative estimators V^βHC2 and V^βHC3 take the form (7.19) but with Ω^ replaced by

Ω~=1ni=1n(1hii)2XiXie^i2

and

Ω¯=1ni=1n(1hii)1XiXie^i2,

respectively. To show that these estimators also consistent for Vβ given Ω^apΩ it is sufficient to show that the differences Ω~Ω^ and Ω¯Ω^ converge in probability to zero as n.

The trick is the fact that the leverage values are asymptotically negligible:

hn=max1inhii=op(1).

(See Theorem 7.17 in Section 7.21.) Then using the triangle inequality (B.16)

Ω¯Ω^1ni=1nXiXie^i2|(1hii)11|(1ni=1nXi2e^i2)|(1hn)11|.

The sum in parenthesis can be shown to be Op(1) under Assumption 7.2 by the same argument as in in the proof of Theorem 7.6. (In fact, it can be shown to converge in probability to E[X2e2].) The term in absolute values is op(1) by (7.24). Thus the product is op(1) which means that Ω¯=Ω^+op(1)p.

Similarly,

Ω~Ω^1ni=1nXiXie^i2|(1hii)21|(1ni=1nXi2e^i2)|(1hn)21|=op(1).

Theorem 7.7 Under Assumption 7.2, as n,Ω~pΩ,Ω¯pΩ,V^βHC1p Vβ,V^βHC2pVβ, and V^βHC3pVβ

Theorem 7.7 shows that the alternative covariance matrix estimators are also consistent for the asymptotic covariance matrix.

To simplify notation, for the remainder of the chapter we will use the notation V^β and V^β^ to refer to any of the heteroskedasticity-consistent covariance matrix estimators HC, HC1, HC2, and HC3, as they all have the same asymptotic limits.

7.10 Functions of Parameters

In most serious applications a researcher is actually interested in a specific transformation of the coefficient vector β=(β1,,βk). For example, the researcher may be interested in a single coefficient βj or a ratio βj/βl. More generally, interest may focus on a quantity such as consumer surplus which could be a complicated function of the coefficients. In any of these cases we can write the parameter of interest θ as a function of the coefficients, e.g. θ=r(β) for some function r:RkRq. The estimate of θ is

θ^=r(β^).

By the continuous mapping theorem (Theorem 6.6) and the fact β^pβ we can deduce that θ^ is consistent for θ if the function r() is continuous.

Theorem 7.8 Under Assumption 7.1, if r(β) is continuous at the true value of β then as n,θ^pθ

Furthermore, if the transformation is sufficiently smooth, by the Delta Method (Theorem 6.8) we can show that θ^ is asymptotically normal.

Assumption 7.3 r(β):RkRq is continuously differentiable at the true value of β and R=βr(β) has rank q.

Theorem 7.9 Asymptotic Distribution of Functions of Parameters Under Assumptions 7.2 and 7.3, as n,

n(θ^θ)dN(0,Vθ)

where Vθ=RVβR.

In many cases the function r(β) is linear:

r(β)=Rβ

for some k×q matrix R. In particular if R is a “selector matrix”

R=(I0)

then we can partition β=(β1,β2) so that Rβ=β1. Then

Vθ=(I0)Vβ(I0)=V11,

the upper-left sub-matrix of V11 given in (7.14). In this case (7.25) states that

n(β^1β1)dN(0,V11).

That is, subsets of β^ are approximately normal with variances given by the conformable subcomponents of V.

To illustrate the case of a nonlinear transformation take the example θ=βj/βl for jl. Then

R=βr(β)=(β1(βj/βl)βj(βj/βl)β(βj/βl)βk(βj/βl))=(01/βlβj/βl20)

so

Vθ=Vjj/βl2+Vllβj2/βl42Vjlβj/βl3

where Vab denotes the abth element of Vβ.

For inference we need an estimator of the asymptotic covariance matrix Vθ=RVβR. For this it is typical to use the plug-in estimator

R^=βr(β^).

The derivative in (7.27) may be calculated analytically or numerically. By analytically, we mean working out the formula for the derivative and replacing the unknowns by point estimates. For example, if θ= βj/βl then βr(β) is (7.26). However in some cases the function r(β) may be extremely complicated and a formula for the analytic derivative may not be easily available. In this case numerical differentiation may be preferable. Let δl=(010) be the unit vector with the ” 1 ” in the lth  place. The jlth element of a numerical derivative R^ is

for some small ϵ.

R^jl=rj(β^+δlϵ)rj(β^)ϵ

The estimator of Vθ is

V^θ=R^V^βR^

Alternatively, the homoskedastic covariance matrix estimator could be used leading to a homoskedastic covariance matrix estimator for θ.

V^θ0=R^V^β0R^=R^Q^XX1R^s2.

Given (7.27), (7.28) and (7.29) are simple to calculate using matrix operations.

As the primary justification for V^θ is the asymptotic approximation (7.25), V^θ is often called an asymptotic covariance matrix estimator.

The estimator V^θ is consistent for Vθ under the conditions of Theorem 7.9 because V^βpV by Theorem 7.6 and

R^=βr(β^)pβr(β)=R

because β^pβ and the function βr(β) is continuous in β. Theorem 7.10 Under Assumptions 7.2 and 7.3, as n,V^θpVθ

Theorem 7.10 shows that V^θ is consistent for Vθ and thus may be used for asymptotic inference. In practice we may set

V^θ^=R^V^β^R^=n1R^V^βR^

as an estimator of the variance of θ^.

7.11 Asymptotic Standard Errors

As described in Section 4.15, a standard error is an estimator of the standard deviation of the distribution of an estimator. Thus if V^β^ is an estimator of the covariance matrix of β^ then standard errors are the square roots of the diagonal elements of this matrix. These take the form

s(β^j)=V^β^j=[V^β^]jj.

Standard errors for θ^ are constructed similarly. Supposing that θ=h(β) is real-valued then the standard error for θ^ is the square root of (7.30)

s(θ^)=R^V^β^R^=n1R^V^βR^

When the justification is based on asymptotic theory we call s(β^j) or s(θ^) an asymptotic standard error for β^j or θ^. When reporting your results it is good practice to report standard errors for each reported estimate and this includes functions and transformations of your parameter estimates. This helps users of the work (including yourself) assess the estimation precision.

We illustrate using the log wage regression

log( wage )=β1 education +β2 experience +β3 experience 2/100+β4+e.

Consider the following three parameters of interest.

  1. Percentage return to education:

θ1=100β1

(100 times the partial derivative of the conditional expectation of log( wage) with respect to education.)

1. Percentage return to experience for individuals with 10 years of experience:

θ2=100β2+20β3

(100 times the partial derivative of the conditional expectation of log wages with respect to experience, evaluated at experience =10.) 3. Experience level which maximizes expected log wages:

θ3=50β2/β3

(The level of experience at which the partial derivative of the conditional expectation of log(wage) with respect to experience equals 0 .)

The 4×1 vector R for these three parameters is

R=(100000),(0100200),(050/β350β2/β320),

respectively.

We use the subsample of married Black women (all experience levels) which has 982 observations. The point estimates and standard errors are

The standard errors are the square roots of the HC2 covariance matrix estimate

Vβ^=(0.6320.1310.14311.10.1310.3900.7316.250.1430.7311.489.4311.16.259.43246)×104.

We calculate that

θ^1=100β^1=100×0.118=11.8s(θ^1)=1002×0.632×104=0.8θ^2=100β^2+20β^3=100×0.01620×0.022=1.16s(θ^2)=(10020)(0.3900.7310.7311.48)(10020)×104=0.55θ^3=50β^2/β^3=50×0.016/0.022=35.2

The calculations show that the estimate of the percentage return to education is 12 per year with a standard error of 0.8. The estimate of the percentage return to experience for those with 10 years of experience is 1.2 per year with a standard error of 0.6. The estimate of the experience level which maximizes expected log wages is 35 years with a standard error of 7 .

In Stata the nlcom command can be used after estimation to perform the same calculations. To illustrate, after estimation of (7.31) use the commands given below. In each case, Stata reports the coefficient estimate, asymptotic standard error, and 95 confidence interval.

Stata Commands\ nlcom 100_b[education]\ nlcom 100_b[experience]+20_b[exp2]\ nlcom -50_b[experience ]/0b[exp2]

7.12 t-statistic

Let θ=r(β):RkR be a parameter of interest, θ^ its estimator, and s(θ^) its asymptotic standard error. Consider the statistic

T(θ)=θ^θs(θ^).

Different writers call (7.33) a t-statistic, a t-ratio, a z-statistic, or a studentized statistic, sometimes using the different labels to distinguish between finite-sample and asymptotic inference. As the statistics themselves are always (7.33) we won’t make this distinction, and will simply refer to T(θ) as a t-statistic or a t-ratio. We also often suppress the parameter dependence, writing it as T. The t-statistic is a function of the estimator, its standard error, and the parameter.

By Theorems 7.9 and 7.10,n(θ^θ)dN(0,Vθ) and V^θpVθ. Thus

T(θ)=θ^θs(θ^)=n(θ^θ)V^θN(0,Vθ)Vθ=ZN(0,1).

The last equality is the property that affine functions of normal variables are normal (Theorem 5.2).

This calculation requires that Vθ>0, otherwise the continuous mapping theorem cannot be employed. In practice this is an innocuous requirement as it only excludes degenerate sampling distributions. Formally we add the following assumption.

Assumption 7.4 Vθ=RVβR>0.

Assumption 7.4 states that Vθ is positive definite. Since R is full rank under Assumption 7.3 a sufficient condition is that Vβ>0. Since QXX>0 a sufficient condition is Ω>0. Thus Assumption 7.4 could be replaced by the assumption Ω>0. Assumption 7.4 is weaker so this is what we use.

Thus the asymptotic distribution of the t-ratio T(θ) is standard normal. Since this distribution does not depend on the parameters we say that T(θ) is asymptotically pivotal. In finite samples T(θ) is not necessarily pivotal but the property means that the dependence on unknowns diminishes as n increases. It is also useful to consider the distribution of the absolute t-ratio |T(θ)|. Since T(θ)dZ the continuous mapping theorem yields |T(θ)|d|Z|. Letting Φ(u)=P[Zu] denote the standard normal distribution function we calculate that the distribution of |Z| is

P[|Z|u]=P[uZu]=P[Zu]P[Z<u]=Φ(u)Φ(u)=2Φ(u)1.

Theorem 7.11 Under Assumptions 7.2, 7.3, and 7.4, T(θ)dZN(0,1) and |T(θ)|d|Z|

The asymptotic normality of Theorem 7.11 is used to justify confidence intervals and tests for the parameters.

7.13 Confidence Intervals

The estimator θ^ is a point estimator for θ, meaning that θ^ is a single value in Rq. A broader concept is a set estimator C^ which is a collection of values in Rq. When the parameter θ is real-valued then it is common to focus on sets of the form C^=[L^,U^] which is called an interval estimator for θ.

An interval estimator C^ is a function of the data and hence is random. The coverage probability of the interval C^=[L^,U^] is P[θC^]. The randomness comes from C^ as the parameter θ is treated as fixed. In Section 5.10 we introduced confidence intervals for the normal regression model which used the finite sample distribution of the t-statistic. When we are outside the normal regression model we cannot rely on the exact normal distribution theory but instead use asymptotic approximations. A benefit is that we can construct confidence intervals for general parameters of interest θ not just regression coefficients.

An interval estimator C^ is called a confidence interval when the goal is to set the coverage probability to equal a pre-specified target such as 90 or 95. C^ is called a 1α confidence interval if infθPθ[θC^]=1α.

When θ^ is asymptotically normal with standard error s(θ^) the conventional confidence interval for θ takes the form

C^=[θ^c×s(θ^),θ^+c×s(θ^)]

where c equals the 1α quantile of the distribution of |Z|. Using (7.34) we calculate that c is equivalently the 1α/2 quantile of the standard normal distribution. Thus, c solves

2Φ(c)1=1α.

This can be computed by, for example, norminv (1α/2) in MATLAB. The confidence interval (7.35) is symmetric about the point estimator θ^ and its length is proportional to the standard error s(θ^).

Equivalently, (7.35) is the set of parameter values for θ such that the t-statistic T(θ) is smaller (in absolute value) than c, that is

C^={θ:|T(θ)|c}={θ:cθ^θs(θ^)c}.

The coverage probability of this confidence interval is

P[θC^]=P[|T(θ)|c]P[|Z|c]=1α

where the limit is taken as n, and holds because T(θ) is asymptotically |Z| by Theorem 7.11. We call the limit the asymptotic coverage probability and call C^ an asymptotic 1α confidence interval for θ. Since the t-ratio is asymptotically pivotal the asymptotic coverage probability is independent of the parameter θ.

It is useful to contrast the confidence interval (7.35) with (5.8) for the normal regression model. They are similar but there are differences. The normal regression interval (5.8) only applies to regression coefficients β not to functions θ of the coefficients. The normal interval (5.8) also is constructed with the homoskedastic standard error, while (7.35) can be constructed with a heteroskedastic-robust standard error. Furthermore, the constants c in (5.8) are calculated using the student t distribution, while c in (7.35) are calculated using the normal distribution. The difference between the student t and normal values are typically small in practice (since sample sizes are large in typical economic applications). However, since the student t values are larger it results in slightly larger confidence intervals which is reasonable. (A practical rule of thumb is that if the sample sizes are sufficiently small that it makes a difference then neither (5.8) nor (7.35) should be trusted.) Despite these differences the coincidence of the intervals means that inference on regression coefficients is generally robust to using either the exact normal sampling assumption or the asymptotic large sample approximation, at least in large samples.

Stata by default reports 95 confidence intervals for each coefficient where the critical values c are calculated using the tnk distribution. This is done for all standard error methods even though it is only exact for homoskedastic standard errors and under normality.

The standard coverage probability for confidence intervals is 95, leading to the choice c=1.96 for the constant in (7.35). Rounding 1.96 to 2 , we obtain the most commonly used confidence interval in applied econometric practice

C^=[θ^2s(θ^),θ^+2s(θ^)].

This is a useful rule-of thumb. This asymptotic 95 confidence interval C^ is simple to compute and can be roughly calculated from tables of coefficient estimates and standard errors. (Technically, it is an asymptotic 95.4 interval due to the substitution of 2.0 for 1.96 but this distinction is overly precise.)

Theorem 7.12 Under Assumptions 7.2, 7.3 and 7.4, for C^ defined in (7.35) with c=Φ1(1α/2),P[θC^]1α. For c=1.96,P[θC^]0.95.

Confidence intervals are a simple yet effective tool to assess estimation uncertainty. When reading a set of empirical results look at the estimated coefficient estimates and the standard errors. For a parameter of interest compute the confidence interval C^ and consider the meaning of the spread of the suggested values. If the range of values in the confidence interval are too wide to learn about θ then do not jump to a conclusion about θ based on the point estimate alone.

For illustration, consider the three examples presented in Section 7.11 based on the log wage regression for married Black women.

Percentage return to education. A 95% asymptotic confidence interval is 11.8±1.96×0.8=[10.2, 13.3]. This is reasonably tight.

Percentage return to experience (per year) for individuals with 10 years experience. A 90 asymptotic confidence interval is 1.1±1.645×0.4=[0.5,1.8]. The interval is positive but broad. This indicates that the return to experience is positive, but of uncertain magnitude. Experience level which maximizes expected log wages. An 80 asymptotic confidence interval is 35±1.28×7=[26,44]. This is rather imprecise, indicating that the estimates are not very informative regarding this parameter.

7.14 Regression Intervals

In the linear regression model the conditional expectation of Y given X=x is

m(x)=E[YX=x]=xβ.

In some cases we want to estimate m(x) at a particular point x. Notice that this is a linear function of β. Letting r(β)=xβ and θ=r(β) we see that m^(x)=θ^=xβ^ and R=x so s(θ^)=xV^β^x. Thus an asymptotic 95 confidence interval for m(x) is

[xβ^±1.96xV^β^x].

It is interesting to observe that if this is viewed as a function of x the width of the confidence interval is dependent on x.

To illustrate we return to the log wage regression (3.12) of Section 3.7. The estimated regression equation is

log( wage )^=xβ^=0.155x+0.698

where x=education. The covariance matrix estimate from (4.43) is

V^β^=(0.0010.0150.0150.243).

Thus the 95 confidence interval for the regression is

0.155x+0.698±1.960.001x20.030x+0.243.

The estimated regression and 95% intervals are shown in Figure 7.4(a). Notice that the confidence bands take a hyperbolic shape. This means that the regression line is less precisely estimated for large and small values of education.

Plots of the estimated regression line and confidence intervals are especially useful when the regression includes nonlinear terms. To illustrate consider the log wage regression (7.31) which includes experience and its square and covariance matrix estimate (7.32). We are interested in plotting the regression estimate and regression intervals as a function of experience. Since the regression also includes education, to plot the estimates in a simple graph we fix education at a specific value. We select education=12. This only affects the level of the estimated regression since education enters without an interaction. Define the points of evaluation

z(x)=(12xx2/1001)

where x= experience.

  1. Wage on Education

  1. Wage on Experience

Figure 7.4: Regression Intervals

The 95 regression interval for education =12 as a function of x= experience is

0.118×12+0.016x0.022x2/100+0.947±1.96z(x)(0.6320.1310.14311.10.1310.3900.7316.250.1430.7311.489.4311.16.259.43246)z(x)×104=0.016x.00022x2+2.36±0.019670.6089.356x+0.54428x20.01462x3+0.000148x4

The estimated regression and 95% intervals are shown in Figure 7.4(b). The regression interval widens greatly for small and large values of experience indicating considerable uncertainty about the effect of experience on mean wages for this population. The confidence bands take a more complicated shape than in Figure 7.4(a) due to the nonlinear specification.

7.15 Forecast Intervals

Suppose we are given a value of the regressor vector Xn+1 for an individual outside the sample and we want to forecast (guess) Yn+1 for this individual. This is equivalent to forecasting Yn+1 given Xn+1=x which will generally be a function of x. A reasonable forecasting rule is the conditional expectation m(x) as it is the mean-square minimizing forecast. A point forecast is the estimated conditional expectation m^(x)=xβ^. We would also like a measure of uncertainty for the forecast.

The forecast error is e^n+1=Yn+1m^(x)=en+1x(β^β). As the out-of-sample error en+1 is inde- pendent of the in-sample estimator β^ this has conditional variance

E[e^n+12Xn+1=x]=E[en+122x(β^β)en+1+x(β^β)(β^β)xXn+1=x]=E[en+12Xn+1=x]+xE[(β^β)(β^β)]x=σ2(x)+xVβ^x.

Under homoskedasticity, E[en+12Xn+1]=σ2. In this case a simple estimator of (7.36) is σ^2+xVβ^x so a standard error for the forecast is s^(x)=σ^2+xVβ^x. Notice that this is different from the standard error for the conditional expectation.

The conventional 95% forecast interval for Yn+1 uses a normal approximation and equals [xβ^±2s^(x)]. It is difficult, however, to fully justify this choice. It would be correct if we have a normal approximation to the ratio

en+1x(β^β)s^(x).

The difficulty is that the equation error en+1 is generally non-normal and asymptotic theory cannot be applied to a single observation. The only special exception is the case where en+1 has the exact distribution N(0,σ2) which is generally invalid.

An accurate forecast interval would use the conditional distribution of en+1 given Xn+1=x, which is more challenging to estimate. Due to this difficulty many applied forecasters use the simple approximate interval [xβ^±2s^(x)] despite the lack of a convincing justification.

7.16 Wald Statistic

Let θ=r(β):RkRq be any parameter vector of interest, θ^ its estimator, and V^θ^ its covariance matrix estimator. Consider the quadratic form

W(θ)=(θ^θ)V^θ^1(θ^θ)=n(θ^θ)V^θ1(θ^θ).

where V^θ=nV^θ^. When q=1, then W(θ)=T(θ)2 is the square of the t-ratio. When q>1,W(θ) is typically called a Wald statistic as it was proposed by Wald (1943). We are interested in its sampling distribution.

The asymptotic distribution of W(θ) is simple to derive given Theorem 7.9 and Theorem 7.10. They show that n(θ^θ)dZN(0,Vθ) and V^θpVθ. It follows that

W(θ)=n(θ^θ)V^θ1n(θ^θ)dZVθ1Z

a quadratic in the normal random vector Z. As shown in Theorem 5.3.5 the distribution of this quadratic form is χq2, a chi-square random variable with q degrees of freedom.

Theorem 7.13 Under Assumptions 7.2, 7.3 and 7.4, as n,W(θ)dχq2.

Theorem 7.13 is used to justify multivariate confidence regions and multivariate hypothesis tests.

7.17 Homoskedastic Wald Statistic

Under the conditional homoskedasticity assumption E[e2X]=σ2 we can construct the Wald statistic using the homoskedastic covariance matrix estimator V^θ0 defined in (7.29). This yields a homoskedastic Wald statistic

W0(θ)=(θ^θ)(V^θ^0)1(θ^θ)=n(θ^θ)(V^θ0)1(θ^θ).

Under the assumption of conditional homoskedasticity it has the same asymptotic distribution as W(θ)

Theorem 7.14 Under Assumptions 7.2, 7.3, and E[e2X]=σ2>0, as n, W0(θ)dχq2

7.18 Confidence Regions

A confidence region C^ is a set estimator for θRq when q>1. A confidence region C^ is a set in Rq intended to cover the true parameter value with a pre-selected probability 1α. Thus an ideal confidence region has the coverage probability P[θC^]=1α. In practice it is typically not possible to construct a region with exact coverage but we can calculate its asymptotic coverage.

When the parameter estimator satisfies the conditions of Theorem 7.13 a good choice for a confidence region is the ellipse

C^={θ:W(θ)c1α}

with c1α the 1α quantile of the χq2 distribution. (Thus Fq(c1α)=1α.) It can be computed by, for example, chi2inv (1α,q) in MATLAB.

Theorem 7.13 implies

P[θC^]P[χq2c1α]=1α

which shows that C^ has asymptotic coverage 1α.

To illustrate the construction of a confidence region, consider the estimated regression (7.31) of

Suppose that the two parameters of interest are the percentage return to education θ1=100β1 and the percentage return to experience for individuals with 10 years experience θ2=100β2+20β3. These two parameters are a linear transformation of the regression parameters with point estimates

θ^=(1000000100200)β^=(11.81.2),

and have the covariance matrix estimate

V^θ^=(0100000010020)V^β^(0010000100020)=(0.6320.1030.1030.157)

with inverse

V^θ^1=(1.771.161.167.13).

Thus the Wald statistic is

W(θ)=(θ^θ)V^θ^1(θ^θ)=(11.8θ11.2θ2)(1.771.161.167.13)(11.8θ11.2θ2)=1.77(11.8θ1)22.32(11.8θ1)(1.2θ2)+7.13(1.2θ2)2.

The 90 quantile of the χ22 distribution is 4.605 (we use the χ22 distribution as the dimension of θ is two) so an asymptotic 90 confidence region for the two parameters is the interior of the ellipse W(θ)= 4.605 which is displayed in Figure 7.5. Since the estimated correlation of the two coefficient estimates is modest (about 0.3 ) the region is modestly elliptical.

Figure 7.5: Confidence Region for Return to Experience and Return to Education

7.19 Edgeworth Expansion*

Theorem 7.11 showed that the t-ratio T(θ) is asymptotically normal. In practice this means that we use the normal distribution to approximate the finite sample distribution of T. How good is this approximation? Some insight into the accuracy of the normal approximation can be obtained by an Edgeworth expansion which is a higher-order approximation to the distribution of T. The following result is an application of Theorem 9.11 of Probability and Statistics for Economists.

Theorem 7.15 Under Assumptions 7.2, 7.3, Ω>0,Ee16<,EX16< ,g(β) has five continuous derivatives in a neighborhood of β, and E[exp(t(e4+X4))]B<1, as n

P[T(θ)x]=Φ(x)+n1/2p1(x)ϕ(x)+n1p2(x)ϕ(x)+o(n1)

uniformly in x, where p1(x) is an even polynomial of order 2 and p2(x) is an odd polynomial of degree 5 with coefficients depending on the moments of e and X up to order 16.

Theorem 7.15 shows that the finite sample distribution of the t-ratio can be approximated up to o(n1) by the sum of three terms, the first being the standard normal distribution, the second a O(n1/2) adjustment, and the third a O(n1) adjustment.

Consider a one-sided confidence interval C^=[θ^z1αs(θ^),) where z1α is the 1αth quantile of ZN(0,1), thus Φ(z1α)1α. Then

P[θC^]=P[T(θ)z1α]=Φ(z1α)+n1/2p1(z1α)ϕ(z1α)+O(n1)=1α+O(n1/2).

This means that the actual coverage is within O(n1/2) of the desired 1α level.

Now consider a two-sided interval C^=[θ^z1α/2s(θ^),θ^+z1α/2s(θ^)]. It has coverage

P[θC^]=P[|T(θ)|z1α/2]=2Φ(z1α/2)1+n12p2(z1α/2)ϕ(z1α/2)+o(n1)=1α+O(n1).

This means that the actual coverage is within O(n1) of the desired 1α level. The accuracy is better than the one-sided interval because the O(n1/2) term in the Edgeworth expansion has offsetting effects in the two tails of the distribution.

7.20 Uniformly Consistent Residuals*

It seems natural to view the residuals e^i as estimators of the unknown errors ei. Are they consistent? In this section we develop a convergence result.

We can write the residual as

e^i=YiXiβ^=eiXi(β^β).

Since β^βp0 it seems reasonable to guess that e^i will be close to ei if n is large.

We can bound the difference in (7.39) using the Schwarz inequality (B.12) to find

|e^iei|=|Xi(β^β)|Xiβ^β.

To bound (7.40) we can use β^β=Op(n1/2) from Theorem 7.3. We also need to bound the random variable Xi. If the regressor is bounded, that is, XiB<, then |e^iei|Bβ^β=Op(n1/2). However if the regressor does not have bounded support then we have to be more careful.

The key is Theorem 6.15 which shows that EXr< implies Xi=op(n1/r) uniformly in i, or

n1/rmax1inXip0.

Applied to (7.40) we obtain

max1in|e^iei|max1inXiβ^β=op(n1/2+1/r).

We have shown the following.

Theorem 7.16 Under Assumption 7.2 and EXr<, then

max1in|e^iei|=op(n1/2+1/r).

The rate of convergence in (7.41) depends on r. Assumption 7.2 requires r4 so the rate of convergence is at least op(n1/4). As r increases the rate improves.

We mentioned in Section 7.7 that there are multiple ways to prove the consistency of the covariance matrix estimator Ω^. We now show that Theorem 7.16 provides one simple method to establish (7.23) and thus Theorem 7.6. Let qn=max1in|e^iei|=op(n1/4). Since e^i2ei2=2ei(e^iei)+(e^iei)2, then

1ni=1nXiXi(e^i2ei2)1ni=1nXiXi|e^i2ei2|2ni=1nXi2|eie^iei|+1ni=1nXi2|e^iei|22ni=1nXi2|ei|qn+1ni=1nXi2qn2op(n1/4).

7.21 Asymptotic Leverage*

Recall the definition of leverage from (3.40) hii=Xi(XX)1Xi. These are the diagonal elements of the projection matrix P and appear in the formula for leave-one-out prediction errors and HC2 and HC3 covariance matrix estimators. We can show that under i.i.d. sampling the leverage values are uniformly asymptotically small.

Let λmin(A) and λmax(A) denote the smallest and largest eigenvalues of a symmetric square matrix A and note that λmax(A1)=(λmin(A))1. Since 1nXXpQXX>0, by the CMT λmin(1nXX)pλmin(QXX)>0. (The latter is positive since QXX is positive definite and thus all its eigenvalues are positive.) Then by the Quadratic Inequality (B.18)

hii=Xi(XX)1Xiλmax((XX)1)(XiXi)=(λmin(1nXX))11nXi2(λmin(QXX)+op(1))11nmax1inXi2.

Theorem 6.15 shows that EXr< implies max1inXi2=(max1inXi)2=op(n2/r) and thus (7.42) is op(n2/r1)

Theorem 7.17 If Xi is i.i.d., QXX>0, and EXr< for some r2, then max1inhii=op(n2/r1).

For any r2 then hii=op (1) (uniformly in in ). Larger r implies a faster rate of convergence. For example r=4 implies hii=op(n1/2).

Theorem (7.17) implies that under random sampling with finite variances and large samples no individual observation should have a large leverage value. Consequently, individual observations should not be influential unless one of these conditions is violated.

7.22 Exercises

Exercise 7.1 Take the model Y=X1β1+X2β2+e with E[Xe]=0. Suppose that β1 is estimated by regressing Y on X1 only. Find the probability limit of this estimator. In general, is it consistent for β1 ? If not, under what conditions is this estimator consistent for β1 ?

Exercise 7.2 Take the model Y=Xβ+e with E[Xe]=0. Define the ridge regression estimator

β^=(i=1nXiXi+λIk)1(i=1nXiYi)

here λ>0 is a fixed constant. Find the probability limit of β^ as n. Is β^ consistent for β ?

Exercise 7.3 For the ridge regression estimator (7.43), set λ=cn where c>0 is fixed as n. Find the probability limit of β^ as n. Exercise 7.4 Verify some of the calculations reported in Section 7.4. Specifically, suppose that X1 and X2 only take the values {1,+1}, symmetrically, with

P[X1=X2=1]=P[X1=X2=1]=3/8P[X1=1,X2=1]=P[X1=1,X2i=1]=1/8E[ei2X1=X2]=54E[ei2X1X2]=14.

Verify the following:\ (a) E[X1]=0\ (b) E[X12]=1\ (c) E[X1X2]=12\ (d) E[e2]=1\ (e) E[X12e2]=1\ (f) E[X1X2e2]=78.

Exercise 7.5 Show (7.13)-(7.16).

Exercise 7.6 The model is

Y=Xβ+eE[Xe]=0Ω=E[XXe2].

Find the method of moments estimators (β^,Ω^) for (β,Ω).

Exercise 7.7 Of the variables (Y,Y,X) only the pair (Y,X) are observed. In this case we say that Y is a latent variable. Suppose

Y=Xβ+eE[Xe]=0Y=Y+u

where u is a measurement error satisfying

E[Xu]=0E[Yu]=0.

Let β^ denote the OLS coefficient from the regression of Y on X.

  1. Is β the coefficient from the linear projection of Y on X ? (b) Is β^ consistent for β as n ?

  2. Find the asymptotic distribution of n(β^β) as n.

Exercise 7.8 Find the asymptotic distribution of n(σ^2σ2) as n.

Exercise 7.9 The model is Y=Xβ+e with E[eX]=0 and XR. Consider the two estimators

β^=i=1nXiYii=1nXi2β~=1ni=1nYiXi.

  1. Under the stated assumptions are both estimators consistent for β ?

  2. Are there conditions under which either estimator is efficient?

Exercise 7.10 In the homoskedastic regression model Y=Xβ+e with E[ex]=0 and E[e2X]=σ2 suppose β^ is the OLS estimator of β with covariance matrix estimator V^β^ based on a sample of size n. Let σ^2 be the estimator of σ2. You wish to forecast an out-of-sample value of Yn+1 given that Xn+1=x. Thus the available information is the sample, the estimates (β^,V^β^,σ^2), the residuals e^i, and the out-of-sample value of the regressors Xn+1.

  1. Find a point forecast of Yn+1.

  2. Find an estimator of the variance of this forecast.

Exercise 7.11 Take a regression model with i.i.d. observations (Yi,Xi) with XR

Y=Xβ+eE[eX]=0Ω=E[X2e2].

Let β^ be the OLS estimator of β with residuals e^i=YiXiβ^. Consider the estimators of Ω

Ω~=1ni=1nXi2ei2Ω^=1ni=1nXi2e^i2.

  1. Find the asymptotic distribution of n(Ω~Ω) as n.

  2. Find the asymptotic distribution of n(Ω^Ω) as n.

  3. How do you use the regression assumption E[eiXi]=0 in your answer to (b)?

Exercise 7.12 Consider the model

Y=α+βX+eE[e]=0E[Xe]=0

with both Y and X scalar. Assuming α>0 and β<0 suppose the parameter of interest is the area under the regression curve (e.g. consumer surplus), which is A=α2/2β.

Let θ^=(α^,β^) be the least squares estimators of θ=(α,β) so that n(θ^θ)dN(0,Vθ) and let V^θ be a standard estimator for Vθ.

  1. Given the above, describe an estimator of A.

  2. Construct an asymptotic 1η confidence interval for A.

Exercise 7.13 Consider an i.i.d. sample {Yi,Xi}i=1,,n where Y and X are scalar. Consider the reverse projection model X=Yγ+u with E[Yu]=0 and define the parameter of interest as θ=1/γ.

  1. Propose an estimator γ^ of γ.

  2. Propose an estimator θ^ of θ.

  3. Find the asymptotic distribution of θ^.

  4. Find an asymptotic standard error for θ^.

Exercise 7.14 Take the model

Y=X1β1+X2β2+eE[Xe]=0

with both β1R and β2R, and define the parameter θ=β1β2.

  1. What is the appropriate estimator θ^ for θ ?

  2. Find the asymptotic distribution of θ^ under standard regularity conditions.

  3. Show how to calculate an asymptotic 95 confidence interval for θ.

Exercise 7.15 Take the linear model Y=Xβ+e with E[eX]=0 and XR. Consider the estimator

β^=i=1nXi3Yii=1nXi4

Find the asymptotic distribution of n(β^β) as n.

Exercise 7.16 From an i.i.d. sample (Yi,Xi) of size n you randomly take half the observations. You estimate a least squares regression of Y on X using only this sub-sample. Is the estimated slope coefficient β^ consistent for the population projection coefficient? Explain your reasoning.

Exercise 7.17 An economist reports a set of parameter estimates, including the coefficient estimates β^1=1.0,β^2=0.8, and standard errors s(β^1)=0.07 and s(β^2)=0.07. The author writes “The estimates show that β1 is larger than β2."

  1. Write down the formula for an asymptotic 95% confidence interval for θ=β1β2, expressed as a function of β^1,β^2,s(β^1),s(β^2) and ρ^, where ρ^ is the estimated correlation between β^1 and β^2.

  2. Can ρ^ be calculated from the reported information? (c) Is the author correct? Does the reported information support the author’s claim?

Exercise 7.18 Suppose an economic model suggests

m(x)=E[YX=x]=β0+β1x+β2x2

where XR. You have a random sample (Yi,Xi),i=1,,n.

  1. Describe how to estimate m(x) at a given value x.

  2. Describe (be specific) an appropriate confidence interval for m(x).

Exercise 7.19 Take the model Y=Xβ+e with E[Xe]=0 and suppose you have observations i=1,,2n. (The number of observations is 2n.) You randomly split the sample in half, (each has n observations), calculate β^1 by least squares on the first sample, and β^2 by least squares on the second sample. What is the asymptotic distribution of n(β^1β^2) ?

Exercise 7.20 The variables {Yi,Xi,Wi} are a random sample. The parameter β is estimated by minimizing the criterion function

S(β)=i=1nWi(YiXiβ)2

That is β^=argminβS(β).

  1. Find an explicit expression for β^.

  2. What population parameter β is β^ estimating? Be explicit about any assumptions you need to impose. Do not make more assumptions than necessary.

  3. Find the probability limit for β^ as n.

  4. Find the asymptotic distribution of n(β^β) as n.

Exercise 7.21 Take the model

Y=Xβ+eE[eX]=0E[e2X]=Zγ

where Z is a (vector) function of X. The sample is i=1,,n with i.i.d. observations. Assume that Zγ>0 for all Z. Suppose you want to forecast Yn+1 given Xn+1=x and Zn+1=z for an out-of-sample observation n+1. Describe how you would construct a point forecast and a forecast interval for Yn+1.

Exercise 7.22 Take the model

Y=Xβ+eE[eX]=0Z=Xβγ+uE[uX]=0

where X is a k vector and Z is scalar. Your goal is to estimate the scalar parameter γ. You use a two-step estimator: - Estimate β^ by least squares of Y on X.

  • Estimate γ^ by least squares of Z on Xβ^.
  1. Show that γ^ is consistent for γ.

  2. Find the asymptotic distribution of γ^ when γ=0

Exercise 7.23 The model is Y=X+e with E[eX]=0 and XR. Consider the estimator

β~=1ni=1nYiXi.

Find conditions under which β~ is consistent for β as n.

Exercise 7.24 The parameter β is defined in the model Y=Xβ+e where e is independent of X0, E[e]=0,E[e2]=σ2. The observables are (Y,X) where X=Xv and v>0 is random scale measurement error, independent of X and e. Consider the least squares estimator β^ for β.

  1. Find the plim of β^ expressed in terms of β and moments of (X,v,e).

  2. Can you find a non-trivial condition under which β^ is consistent for β ? (By non-trivial we mean something other than v=1.)

Exercise 7.25 Take the projection model Y=Xβ+e with E[Xe]=0. For a positive function w(x) let Wi=w(Xi). Consider the estimator

β~=(i=1nWiXiXi)1(i=1nWiXiYi).

Find the probability limit (as n ) of β~. Do you need to add an assumption? Is β~ consistent for β~ ? If not, under what assumption is β~ consistent for β ?

Exercise 7.26 Take the regression model

Y=Xβ+eE[eX]=0E[e2X=x]=σ2(x)

with XRk. Assume that P[e=0]=0. Consider the infeasible estimator

β~=(i=1nei2XiXi)1(i=1nei2XiYi).

This is a WLS estimator using the weights ei2.

  1. Find the asymptotic distribution of β~.

  2. Contrast your result with the asymptotic distribution of infeasible GLS. Exercise 7.27 The model is Y=Xβ+e with E[eX]=0. An econometrician is worried about the impact of some unusually large values of the regressors. The model is thus estimated on the subsample for which |Xi|c for some fixed c. Let β~ denote the OLS estimator on this subsample. It equals

β~=(i=1nXiXi1{|Xi|c})1(i=1nXiYi1{|Xi|c}).

  1. Show that β~pβ.

  2. Find the asymptotic distribution of n(β~β).

Exercise 7.28 As in Exercise 3.26, use the cps09mar dataset and the subsample of white male Hispanics. Estimate the regression

log( wage )^=β1 education +β2 experience +β3 experience 2/100+β4.

  1. Report the coefficient estimates and robust standard errors.

  2. Let θ be the ratio of the return to one year of education to the return to one year of experience for experience =10. Write θ as a function of the regression coefficients and variables. Compute θ^ from the estimated model.

  3. Write out the formula for the asymptotic standard error for θ^ as a function of the covariance matrix for β^. Compute s(θ^) from the estimated model.

  4. Construct a 90 asymptotic confidence interval for θ from the estimated model.

  5. Compute the regression function at education =12 and experience =20. Compute a 95% confidence interval for the regression function at this point.

  6. Consider an out-of-sample individual with 16 years of education and 5 years experience. Construct an 80 forecast interval for their log wage and wage. [To obtain the forecast interval for the wage, apply the exponential function to both endpoints.]