20  Series Regression

20.1 Introduction

Chapter 19 studied nonparametric regression by kernel smoothing methods. In this chapter we study an alternative class of nonparametric methods known as series regression.

The basic model is identical to that examined in Chapter 19. We assume that there are random variables (Y,X) such that E[Y2]< and satisfy the regression model

Y=m(X)+eE[eX]=0E[e2X]=σ2(X).

The goal is to estimate the CEF m(x). We start with the simple setting where X is scalar and consider more general cases later.

A series regression model is a sequence K=1,2,, of approximating models mK(x) with K parameters. In this chapter we exclusively focus on linear series models, and in particular polynomials and splines. This is because these are simple, convenient, and cover most applications of series methods in applied economics. Other series models include trigonometric polynomials, wavelets, orthogonal wavelets, B-splines, and neural networks. For a detailed review see Chen (2007).

Linear series regression models take the form

Y=XKβK+eK

where XK=XK(X) is a vector of regressors obtained by making transformations of X and βK is a coefficient vector. There are multiple possible definitions of the coefficient βK. We define 1 it by projection

βK=E[XKXK]1E[XKY]=E[XKXK]1E[XKm(X)].

The series regression error eK is defined by (20.2) and (20.3), is distinct from the regression error e in (20.1), and is indexed by K because it depends on the regressors XK. The series approximation to m(x) is

mK(x)=XK(x)βK.

1 An alternative is to define βK as the best uniform approximation as in (20.8). It is not critical so long as we are careful to be consistent with our notation. The coefficient is typically 2 estimated by least squares

β^K=(i=1nXKiXKi)1(i=1nXKiYi)=(XKXK)1(XKY)

The estimator for m(x) is

m^K(x)=XK(x)β^K.

The difference between specific models arises due to the different choices of transformations XK(x).

The theoretical issues we will explore in this chapter are: (1) Approximation properties of polynomials and splines; (2) Consistent estimation of m(x); (3) Asymptotic normal approximations; (4) Selection of K; (5) Extensions.

For a textbook treatment of series regression see Li and Racine (2007). For an advanced treatment see Chen (2007). Two seminal contributions are Andrews (1991a) and Newey (1997). Two recent important papers are Belloni, Chernozhukov, Chetverikov, and Kato (2015) and Chen and Christensen (2015).

20.2 Polynomial Regression

The prototypical series regression model for m(x) is a pth order polynomial

mK(x)=β0+β1x+β2x2++βpxp.

We can write it in vector notation as (20.4) where

XK(x)=(1xxp).

The number of parameters is K=p+1. Notice that we index XK(x) and βK by K as their dimensions and values vary with K.

The implied polynomial regression model for the random pair (Y,X) is (20.2) with

XK=XK(X)=(1XXp)

The degree of flexibility of a polynomial regression is controlled by the polynomial order p. A larger p yields a more flexible model while a smaller p typically results in a estimator with a smaller variance.

In general, a linear series regression model takes the form

mK(x)=β1τ1(x)+β2τ2(x)++βKτK(x)

where the functions τj(x) are called the basis transformations. The polynomial regression model uses the power basis τj(x)=xj1. The model mK(x) is called a series regression because it is obtained by sequentially adding the series of variables τj(x).

2 Penalized estimators have also been recommended. We do not review these methods here.

20.3 Illustrating Polynomial Regression

Consider the cps09mar dataset and a regression of log( wage ) on experience for women with a college education (education=16), separately for white women and Black women. The classical Mincer model uses a quadratic in experience. Given the large sample sizes (4682 for white women and 517 for Black women) we can consider higher order polynomials. In Figure 20.1 we plot least squares estimates of the CEFs using polynomials of order 2, 4, 8, and 12.

Examine panel (a) which shows the estimates for the sub-sample of white women. The quadratic specification appears mis-specified with a shape noticably different from the other estimates. The difference between the polynomials of order 4,8 , and 12 is relatively minor, especially for experience levels below 20.

Now examine panel (b) which shows the estimates for the sub-sample of Black women. This panel is quite different from panel (a). The estimates are erratic and increasingly so as the polynomial order increases. Assuming we are expecting a concave (or nearly concave) experience profile the only estimate which satisfies this is the quadratic.

Why the difference between panels (a) and (b)? The most likely explanation is the different sample sizes. The sub-sample of Black women has much fewer observations so the CEF is much less precisely estimated, giving rise to the erratic plots. This suggests (informally) that it may be preferred to use a smaller polynomial order p in the second sub-sample, or equivalently to use a larger p when the sample size n is larger. The idea that model complexity - the number of coefficients K - should vary with sample size n is an important feature of series regression.

The erratic nature of the estimated polynomial regressions in Figure 20.1(b) is a common feature of higher-order estimated polynomial regressions. Better results can sometimes be obtained by a spline regression which is described in Section 20.5.

  1. White Women

  1. Black Women

Figure 20.1: Polynomial Estimates of Experience Profile

20.4 Orthogonal Polynomials

Standard implementation of the least squares estimator (20.5) of a polynomial regression may return a computational error message when p is large. (See Section 3.24.) This is because the moments of Xj can be highly heterogeneous across j and because the variables Xj can be highly correlated. These two factors imply in practice that the matrix XKXK can be ill-conditioned (the ratio of the largest to smallest eigenvalue can be quite large) and some packages will return error messages rather than compute β^K.

In most cases the condition of XKXK can be dramatically improved by rescaling the observations. As discussed in Section 3.24 a simple method for non-negative regressors is to rescale each by its sample mean, e.g. replace Xij with Xij/(n1i=1nXij). Even better conditioning can often be obtained by rescaling Xi to lie in [1,1] before applying powers. In most applications one of these methods will be sufficient for a well-conditioned regression.

A computationally more robust implementation can be obtained by using orthogonal polynomials. These are linear combinations of the polynomial basis functions and produce identical regression estimators (20.6). The goal of orthogonal polynomials is to produce regressors which are either orthogonal or close to orthogonal and have similar variances so that XKXK is close to diagonal with similar diagonal elements. These orthogonalized regressors XK=AKXK can be written as linear combinations of the original variables XK. If the regressors are orthogonalized then the regression estimator (20.6) is modified by replacing XK(x) with XK(x)=AKXK(x).

One approach is to use sample orthogonalization. This is done by a sequence of regressions of Xij on the previously orthogonalized variables and then rescaling. This will result in perfectly orthogonalized variables. This is what is implemented in many statistical packages under the label “orthogonal polynomials”, for example, the function poly in R. If this is done then the least squares coefficients have no meaning outside this specific sample and it is not convenient for calculation of m^K(x) for values of x other than sample values. This is the approach used for the examples presented in the previous section.

Another approach is to use an algebraic orthogonal polynomial. This is a polynomial which is orthogonal with respect to a known weight function w(x). Specifically, it is a sequence pj(x),j=0,1,2,, with the property that pj(x)p(x)w(x)dx=0 for j. This means that if w(x)=f(x), the marginal density of X, then the basis transformations pj(X) will be mutually orthogonal (in expectation). Since we do now know the density of X this is not feasible in practice, but if w(x) is close to the density of X then we can expect that the basis transformations will be close to mutually orthogonal. To implement an algebraic orthogonal polynomial you first should rescale your X variable so that it satisfies the support for the weight function w(x).

The following three choices are most relevant for economic applications.

Legendre Polynomial. These are orthogonal with respect to the uniform density on [1,1]. (So should be applied to regressors scaled to have support in [1,1].)

pj(x)=12j=0j(j)2(x1)j(x+1).

For example, the first four are p0(x)=1,p1(x)=x,p2(x)=(3x21)/2, and p3(x)=(5x33x)/2. The best computational method is the recurrence relationship

pj+1(x)=(2j+1)xpj(x)jpj1(x)j+1.

Laguerre Polynomial. These are orthogonal with respect to the exponential density ex on [0,). (So should be applied to non-negative regressors scaled if possible to have approximately unit mean and/or variance.)

pj(x)==0j(j)(x)!.

For example, the first four are p0(x)=1,p1(x)=1x,p2(x)=(x24x+2)/2, and p3(x)=(x3+9x218x+6)/6. The best computational method is the recurrence relationship

pj+1(x)=(2j+1x)pj(x)jpj1(x)j+1.

Hermite Polynomial. These are orthogonal with respect to the standard normal density on (,). (So should be applied to regressors scaled to have mean zero and variance one.)

pj(x)=j!=0j/2(1/2)x2j!(j2!).

For example, the first four are p0(x)=1,p1(x)=x,p2(x)=x21, and p3(x)=x33x. The best computational method is the recurrence relationship

pj+1(x)=xpj(x)jpj1(x).

The R package orthopolynom provides a convenient set of commands to compute many orthogonal polynomials including the above.

20.5 Splines

A spline is a piecewise polynomial. Typically the order of the polynomial is pre-selected to be linear, quadratic, or cubic. The flexibility of the model is determined by the number of polynomial segments. The join points between the segments are called knots.

To impose smoothness and parsimony it is common to constrain the spline function to have continuous derivatives up to the order of the spline. Thus a linear spline is constrained to be continuous, a quadratic spline is constrained to have a continuous first derivative, and a cubic spline is constrained to have continuous first and second derivatives.

A simple way to construct a regression spline is as follows. A linear spline with one knot τ is

mK(x)=β0+β1x+β2(xτ)1{xτ}.

To see that this is a linear spline, observe that for xτ the function mK(x)=β0+β1x is linear with slope β1; for xτ the function mK(x) is linear with slope β1+β2; and the function is continuous at x=τ. Note that β2 is the change in the slope at τ. A linear spline with two knots τ1<τ2 is

mK(x)=β0+β1x+β2(xτ1)1{xτ2}+β3(xτ2)1{xτ2}.

A quadratic spline with one knot is

mK(x)=β0+β1x+β2x2+β3(xτ)21{xτ}.

To see that this is a quadratic spline, observe that for xτ the function is the quadratic β0+β1x+β2x2 with second derivative mK(τ)=2β2; for xτ the second derivative is mK(τ)=2(β2+β3); so 2β3 is the change in the second derivative at τ. The first derivative at x=τ is the continuous function mK(τ)= β1+2β2τ.

In general, a pth-order spline with N knots τ1<τ2<<τN is

mK(x)=j=0pβjxj+k=1Nβp+k(xτk)p1{xτk}

which has K=N+p+1 coefficients.

The implied spline regression model for the random pair (Y,X) is (20.2) where

XK=XK(X)=(1XXp(Xτ1)p1{Xτ1}(XτN)p1{XτN}).

In practice a spline will depend critically on the choice of the knots τk. When X is bounded with an approximately uniform distribution it is common to space the knots evenly so all segments have the same length. When the distribution of X is not uniform an alternative is to set the knots at the quantiles j/(N+1) so that the probability mass is equalized across segments. A third alternative is to set the knots at the points where m(x) has the greatest change in curvature (see Schumaker (2007), Chapter 7). In all cases the set of knots τj can change with K. Therefore a spline is a special case of an approximation of the form

mK(x)=β1τ1K(x)+β2τ2K(x)++βKτKK(x)

where the basis transformations τjK(x) depend on both j and K. Many authors call such approximations a sieve rather than a series because the basis transformations change with K. This distinction is not critical to our treatment so for simplicity we refer to splines as series regression models.

20.6 Illustrating Spline Regression

In Section 20.3 we illustrated regressions of log(wage) on experience for white and Black women with a college education. Now we consider a similar regression for Black men with a college education, a sub-sample with 394 observations.

We use a quadratic spline with four knots at experience levels of 10,20,30, and 40 . This is a regression model with seven coefficients. The estimated regression function is displayed in Figure 20.2(a). An estimated 6th  order polynomial regression is also displayed for comparison (a 6th  order polynomial is an appropriate comparison because it also has seven coefficients).

While the spline is a quadratic over each segment, what you can see is that the first two segments (experience levels between 0-10 and 10-20 years) are essentially linear. Most of the curvature occurs in the third and fourth segments (20-30 and 30-40 years) where the estimated regression function peaks and twists into a negative slope. The estimated regression function is smooth.

A quadratic or cubic spline is useful when it is desired to impose smoothness as in Figure 20.2(a). In contrast, a linear spline is useful when it is desired to allow for sharp changes in slope.

To illustrate we consider the data set CHJ2004 which is a sample of 8684 urban Phillipino households from Cox, B. E. Hansen, and Jimenez (2004). This paper studied the crowding-out impact of a

  1. Experience Profile

  1. Effect of Income on Transfers

Figure 20.2: Spline Regression Estimates

family’s income on non-governmental (e.g., extended family) income transfers 3. A model of altruistic transfers predicts that extended families will make gifts (transfers) when the recipient family’s income is sufficiently low, but will not make transfers if the recipient family’s income exceeds a threshold. A pure altruistic model predicts that the regression of transfers received on family income should have a slope of 1 up to this threshold and be flat above this threshold. We estimated this regression (including the same controls as the authors 4 ) using a linear spline with knots at 10000, 20000, 50000, 100000, and 150000 pesos. These knots were selected to give flexibility for low income levels where there are more observations. This model has a total of 22 coefficients.

The estimated regression function (as a function of household income) is displayed in Figure 20.2 (b). For the first two segments (incomes levels below 20000 pesos) the regression function is negatively sloped as predicted with a slope about 0.7 from 0 to 10000 pesos, and 0.3 from 10000 to 20000 pesos. The estimated regression function is effectively flat for income levels above 20000 pesos. This shape is consistent with the pure altruism model. A linear spline model is particularly well suited for this application as it allows for discontinuous changes in slope.

Linear spline models with a single knot have been recently popularized by Card, Lee, Pei, and Weber (2015) with the label regression kink design.

20.7 The Global/Local Nature of Series Regression

Recall from Section 19.18 that we described kernel regression as inherently local in nature. The Nadaraya-Watson, Local Linear, and Local Polynomial estimators of the CEF m(x) are weighted averages of Yi for observations for which Xi is close to x.

3 Defined as the sum of transfers received domestically, from abroad, and in-kind, less gifts.

4 The controls are: age of household head, education (5 dummy categories), married, female, married female, number of children (3 dummies), size of household, employment status (2 dummies). In contrast, series regression is typically described as global in nature. The estimator m^K(x)=XK(x)β^K is a function of the entire sample. The coefficients of a fitted polynomial (or spline) are affected by the global shape of the function m(x) and thus affect the estimator m^K(x) at any point x.

While this description has some merit it is not a complete description. As we now show, series regression estimators share the local smoothing property of kernel regression. As the number of series terms K increase a series estimator m^K(x)=XK(x)β^K also becomes a local weighted average estimator.

To see this, observe that we can write the estimator as

m^K(x)=XK(x)(XKXK)1(XKY)=1ni=1nXK(x)Q^K1XK(Xi)Yi=1ni=1nw^K(x,Xi)Yi

where Q^K=n1XKXK and w^K(x,u)=xK(x)Q^K1xK(u). Thus m^K(x) is a weighted average of Yi using the weights w^K(x,Xi). The weight function w^K(x,Xi) appears to be maximized at Xi=x, so m^(x) puts more weight on observations for which Xi is close to x, similarly to kernel regression.

  1. x=0.5

  1. x=0.25

Figure 20.3: Kernel Representation of Polynomial Weight Function

To see this more precisely, observe that because Q^K will be close in large samples to QK=E[XKXK], w^K(x,u) will be close to the deterministic weight function

wK(x,u)=XK(x)QK1XK(u).

Take the case XU[0,1]. In Figure 20.3 we plot the weight function wK(x,u) as a funtion of u for x= 0.5 (panel (a)) and x=0.25 (panel (b)) for p=4,8,12 in panel (a) and p=4, 12 in panel (b). First, examine panel (a). Here you can see that the weight function w(x,u) is symmetric in u about x. For p=4 the weight function appears similar to a quadratic in u, and as p increases the weight function concentrates its main weight around x. However, the weight function is not non-negative. It is quite similar in shape to what are known as higher-order (or bias-reducing) kernels, which were not reviewed in the previous chapter but are part of the kernel estimation toolkit. Second, examine panel (b). Again the weight function is maximized at x, but now it is asymmetric in u about the point x. Still, the general features from panel (a) carry over to panel (b). Namely, as p increases the polynomial estimator puts most weight on observations for which X is close to x (just as for kernel regression), but is different from conventional kernel regression in that the weight function is not non-negative. Qualitatively similar plots are obtained for spline regression.

There is little formal theory (of which I am aware) which makes a formal link between series regression and kernel regression so the comments presented here are illustrative 5. However, the point is that statements of the form “Series regession is a global method; Kernel regression is a local method” may not be complete. Both are global in nature when h is large (kernels) or K is small (series), and are local in nature when h is small (kernels) or K is large (series).

20.8 Stone-Weierstrass and Jackson Approximation Theory

A good series approximation mK(x) has the property that it gets close to the true CEF m(x) as the complexity K increases. Formal statements can be derived from the mathematical theory of the approximation of functions.

An elegant and famous theorem is the Stone-Weierstrass Theorem (Weierstrass, 1885, Stone, 1948) which states that any continuous function can be uniformly well approximated by a polynomial of sufficiently high order. Specifically, the theorem states that if m(x) is continuous on a compact set S then for any ϵ>0 there is some K sufficiently large such that

infβsupxS|m(x)XK(x)β|ϵ.

Thus the true unknown m(x) can be arbitrarily well approximated by selecting a suitable polynomial.

Jackson (1912) strengthened this result to give convergence rates which depend on the smoothness of m(x). The basic result has been extended to spline functions. The following notation will be useful. Define the β which minimizes the left-side of (20.7) as

βK=argminβsupxS|m(x)XK(x)β|,

define the approximation error

rK(x)=m(x)XK(x)βK,

and define the minimized value of (20.7)

δK= def infβsupxS|m(x)XK(x)β|=supxS|m(x)XK(x)βK|=supxS|rK(x)|.

5 Similar connections are made in the appendix of Chen, Liao, and Sun (2012). Theorem 20.1 If for some α0,m(α)(x) is uniformly continuous on a compact set S and XK(x) is either a polynomial basis or a spline basis (with uniform knot spacing) of order sα, then as K

δKo(Kα).

Furthermore, if m(2)(x) is uniformly continuous on S and XK(x) is a linear spline basis, then δKO(K2).

For a proof for the polynomial case see Theorem 4.3 of Lorentz (1986) or Theorem 3.12 of Schumaker (2007) plus his equations (2.119) and (2.121). For the spline case see Theorem 6.27 of Schumaker (2007) plus his equations (2.119) and (2.121). For the linear spline case see Theorem 6.15 of Schumaker, equation (6.28).

Theorem 20.1 is more useful than the classic Stone-Weierstrass Theorem as it gives an approximation rate which depends on the smoothness order α. The rate o(Kα) in (20.11) means that the approximation error (20.10) decreases as K increases and decreases at a faster rate when α is large. The standard interpretation is that when m(x) is smoother it is possible to approximate it with fewer terms.

It will turn out that for our distribution theory it is sufficient to consider the case that m(2)(x) is uniformly continuous. For this case Theorem 20.1 shows that polynomials and quadratic/cubic splines achieve the rate o(K2) and linear splines achieve the rate O(K2). For most of of our results the latter bound will be sufficient.

More generally, Theorem 20.1 makes a distinction between polynomials and splines as polynomials achieve the rate o(Kα) adaptively (without input from the user) while splines achieve the rate o(Kα) only if the spline order s is appropriately chosen. This is an advantage for polynomials. However, as emphasized by Schumaker (2007), splines simultaneously approximate the derivatives m(q)(x) for q< α. Thus, for example, a quadratic spline simultaneously approximates the function m(x) and its first derivative m(x). There is no comparable result for polynomials. This is an advantage for quadratic and cubic splines. Since economists are often more interested in marginal effects (derivatives) than in levels this may be a good reason to prefer splines over polynomials.

Theorem 20.1 is a bound on the best uniform approximation error. The coefficient βK which minimizes (20.11) is not, however, the projection coefficient βK as defined in (20.3). Thus Theorem 20.1 does not directly inform us concerning the approximation error obtained by series regression. It turns out, however, that the projection error can be easily deduced from (20.11).

Definition 20.1 The projection approximation error is

rK(x)=m(x)XK(x)βK

where the coefficient βK is the projection coefficient (20.3). The realized projection approximation error is rK=rK(X). The expected squared projection error is

δK2=E[rK2].

The projection approximation error is similar to (20.9) but evaluated using the projection coefficient rather than the minimizing coefficient βK (20.8). Assuming that X has compact support S the expected squared projection error satisfies

δK=(S(m(x)XK(x)βK)2dF(x))1/2(S(m(x)XK(x)βK)2dF(x))1/2(SδK2dF(x))1/2=δK.

The first inequality holds because the projection coefficient βK minimizes the expected squared projection error (see Section 2.25). The second inequality is the definition of δK. Combined with Theorem 20.1 we have established the following result.

Theorem 20.2 If X has compact support S, for some α0m(α)(x) is uniformly continuous on S, and XK(x) is either a polynomial basis or a spline basis of order sα, then as K

δKδKo(Kα).

Furthermore, if m(2)(x) is uniformly continuous on S and XK(x) is a linear spline basis, then δKO(K2).

The available theory of the approximation of functions goes beyond the results described here. For example, there is a theory of weighted polynomial approximation (Mhaskar, 1996) which provides an analog of Theorem 20.2 for the unbounded real line when X has a density with exponential tails.

20.9 Regressor Bounds

The approximation result in Theorem 20.2 assumes that the regressors X have bounded support S. This is conventional in series regression theory as it greatly simplifies the analysis. Bounded support implies that the regressor function XK(x) is bounded. Define

ζK(x)=(XK(x)QK1XK(x))1/2ζK=supxζK(x)

where QK=E[XKXK] is the population design matrix given the regressors XK. This implies that for all realizations of XK

(XKQK1XK)1/2ζK.

The constant ζK(x) is the normalized length of the regressor vector XK(x). The constant ζK is the maximum normalized length. Their values are determined by the basis function transformations and the distribution of X. They are invariant to rescaling XK or linear rotations.

For polynomials and splines we have explicit expressions for the rate at which ζK grows with K. Theorem 20.3 If X has compact support S with a strictly positive density f(x) on S then

  1. ζKO(K) for polynomials

  2. ζKO(K1/2) for splines.

For a proof of Theorem 20.3 see Newey (1997, Theorem 4).

Furthermore, when X is uniformly distributed then we can explicitly calculate for polynomials that ζK=K, so the polynomial bound ζKO(K) cannot be improved.

To illustrate, we plot in Figure 20.4 the values ζK(x) for the case XU[0,1]. We plot ζK(x) for a polynomial of degree p=9 and a quadratic spline with N=7 knots (both satisfy K=10 ). You can see that the values of ζK(x) are close to 3 for both basis transformations and most values of x, but ζK(x) increases sharply for x near the boundary. The maximum values are ζK=10 for the polynomial and ζK=7.4 for the quadratic spline. While Theorem 20.3 shows the two have different rates for large K, we see for moderate K that the differences are relatively minor.

Figure 20.4: Normalized Regressor Length

20.10 Matrix Convergence

One of the challenges which arise when developing a theory for the least squares estimator is how to describe the large-sample behavior of the sample design matrix

Q^K=1ni=1nXKiXKi

as K. The difficulty is that its dimension changes with K so we cannot apply a standard WLLN.

It turns out to be convenient if we first rotate the regressor vector so that the elements are orthogonal in expectation. Thus we define the standardized regressors and design matrix as

X~Ki=QK1/2XKiQ~K=1ni=1nX~KiX~Ki.

Note that E[X~KX~K]=IK. The standardized regressors are not used in practice; they are introduced only to simplify the theoretical derivations.

Our convergence theory will require the following fundamental rate bound on the number of coefficients K.

Assumption $20.1

  1. λmin(QK)λ>0

  2. ζK2log(K)/n0 as n,K

Assumption 20.1.1 ensures that the transformation (20.18) is well defined 6. Assumption 20.1.2 states that the squared maximum regressor length ζK2 grows slower than n. Since ζK increases with K this is a bound on the rate at which K can increase with n. By Theorem 20.2 the rate in Assumption 20.1.2 holds for polynomials if K2log(K)/n0 and for splines if Klog(K)/n0. In either case, this means that the number of coefficients K is growing at a rate slower than n.

We are now in a position to describe a convergence result for the standardized design matrix. The following is Lemma 6.2 of Belloni, Chernozhukov, Chetverikov, and Kato (2015).

Theorem 20.4 If Assumption 20.1 holds then

Q~KIKp0.

A proof of Theorem 20.4 using a stronger condition than Assumption 20.1 can be found in Section 20.31. The norm in (20.19) is the spectral norm

A=(λmax(AA))1/2

6 Technically, what is required is that λmin(BKQKBK)λ>0 for some K×K sequence of matrices BK, or equivalently that Assumption 20.1.1 holds after replacing XK with BKXK. where λmax(B) denotes the largest eigenvalue of the matrix B. For a full description see Section A.23.

For the least squares estimator what is particularly important is the inverse of the sample design matrix. Fortunately we can easily deduce consistency of its inverse from (20.19) when the regressors have been orthogonalized as described.

Theorem 20.5 If Assumption 20.1 holds then

Q~K1IKp0

and

λmax(Q~K1)=1/λmin(Q~K)p1.

The proof of Theorem 20.5 can be found in Section 20.31.

20.11 Consistent Estimation

In this section we give conditions for consistent estimation of m(x) by the series estimator m^K(x)= XK(x)β^K.

We know from standard regression theory that for any fixed K,β^KpβK and thus m^K(x)=XK(x)β^Kp XK(x)βK as n. Furthermore, from the Stone-Weierstrass Theorem we know that XK(x)βKm(x) as K. It therefore seems reasonable to expect that m^K(x)pm(x) as both n and K together. Making this argument rigorous, however, is technically challenging, in part because the dimensions of β^K and its components are changing with K.

Since m^K(x) and m(x) are functions, convergence should be defined with respect to an appropriate metric. For kernel regression we focused on pointwise convergence (for each value of x separately) as that is the simplest to analyze. For series regression it turns out to be simplest to describe convergence with respect to integrated squared error (ISE). We define the latter as

ISE(K)=(m^K(x)m(x))2dF(x)

where F is the marginal distribution of X. ISE (K) is the average squared distance between m^K(x) and m(x), weighted by the marginal distribution of X. The ISE is random, depends on both sample size n and model complexity K, and its distribution is determined by the joint distribution of (Y,X). We can establish the following.

Theorem 20.6 Under Assumption 20.1 and δK=o(1), then as n,K,

ISE(K)=op(1).

The proof of Theorem 20.6 can be found in Section 20.31.

Theorem 20.6 shows that the series estimator m^K(x) is consistent in the ISE norm under mild conditions. The assumption δK=o(1) holds for polynomials and splines if K and m(x) is uniformly continuous. This result is analogous to Theorem 19.8 which showed that kernel regression estimator is consistent if m(x) is continuous.

20.12 Convergence Rate

We now give a rate of convergence.

Theorem 20.7 Under Assumption 20.1 and σ2(x)σ¯2<, then as n,K,

ISE(K)Op(δK2+Kn)

where δK2 is the expected squared prediction error (20.13). Furthermore, if m(x) is uniformly continuous then for polynomial or spline basis functions

ISE(K)Op(K4+Kn).

The proof of Theorem 20.7 can be found in Section 20.31. It is based on Newey (1997).

The bound (20.25) is particularly useful as it gives an explicit rate in terms of K and n. The result shows that the integrated squared error is bounded in probability by two terms. The first K4 is the squared bias. The second K/n is the estimation variance. This is analogous to the AIMSE for kernel regression (19.5). We can see that increasing the number of series terms K affects the integrated squared error by decreasing the bias but increasing the variance. The fact that the estimation variance is of order K/n can be intuitively explained by the fact that the regression model is estimating K coefficients.

For polynomials and quadratic splines the bound (20.25) can be written as op(K4)+Op(K/n).

We are interested in the sequence K which minimizes the trade-off in (20.25). By examining the firstorder condition we find that the sequence which minimizes this bound is Kn1/5. With this choice we obtain the optimal integrated squared error ISE(K)Op(n4/5). This is the same convergence rate as obtained by kernel regression under similar assumptions.

It is interesting to contrast the optimal rate Kn1/5 for series regression with hn1/5 for kernel regression. Essentially, one can view K1 in series regression as a “bandwidth” similar to kernel regression, or one can view 1/h in kernel regression as the effective number of coefficients.

The rate Kn1/5 means that the optimal K increases very slowly with the sample size. For example, doubling your sample size implies a 15 increase in the optimal number of coefficients K. To obtain a doubling in the optimal number of coefficients you need to multiply the sample size by 32.

To illustrate, Figure 20.5 displays the ISE rate bounds K4+K/n as a function of K for n=10,30,150. The filled circles mark the ISE-minimizing K, which are K=2, 3, and 4 for the three functions. Notice that the ISE functions are steeply downward sloping for small K and nearly flat for large K (when n is large). This is because the bias term K4 dominates for small values of K while the variance term K/n dominates for large values of K and the latter flattens as n increases.

20.13 Asymptotic Normality

Take a parameter θ=a(m) which is a real-valued linear function of the regression function. This includes the regression function m(x) at a given point x, derivatives of m(x), and integrals over m(x). Given m^K(x)=XK(x)β^K as an estimator for m(x), the estimator for θ is θ^K=a(m^K)=aKβ^K for some K×1 vector of constants aK0. (The relationship a(m^K)=aKβ^K follows because a is linear in m and m^K is linear in β^K.)

Figure 20.5: Integrated Squared Error

If K were fixed as n then by standard asymptotic theory we would expect θ^K to be asymptotically normal with variance VK=aKQK1ΩKQK1aK where ΩK=E[XKXKe2]. The standard justification, however, is not valid in the nonparametric case. This is in part because VK may diverge as K, and in part due to the finite sample bias due to the approximation error. Therefore a new theory is required. Interestingly, it turns out that in the nonparametric case θ^K is still asymptotically normal and VK is still the appropriate variance for θ^K. The proof is different than the parametric case as the dimensions of the matrices are increasing with K and we need to be attentive to the estimator’s bias due to the series approximation.

Assumption 20.2.1 is conditional square integrability. It implies that the conditional variance E[e2X] is bounded. It is used to verify the Lindeberg condition for the CLT. Assumption 20.2.2 states that the conditional variance is nowhere degenerate. Thus there is no X for which Y is perfectly predictable. This is a technical condition used to bound VK from below.

Assumption 20.2.3 states that approximation error δK declines faster than the maximal regressor length ζK. For polynomials a sufficient condition for this assumption is that m(x) is uniformly continuous. For splines a sufficient condition is that m(x) is uniformly continuous.

Theorem 20.8 Under Assumption 20.2, as n,

n(θ^Kθ+a(rK))VK1/2dN(0,1).

The proof of Theorem 20.8 can be found in Section 20.31.

Theorem 20.8 shows that the estimator θ^K is approximately normal with bias a(rK) and variance VK/n. The variance is the same as in the parametric case. The asymptotic bias is similar to that found in kernel regression.

One useful message from Theorem 20.8 is that the classical variance formula VK for θ^K applies to series regression. This justifies conventional estimators for VK as will be discussed in Section 20.18.

Theorem 20.8 shows that the estimator θ^K has a bias a(rK). What is this? It is the same transformation of the function rK(x) as θ=a(m) is of the regression function m(x). For example, if θ=m(x) is the regression at a fixed point x then a(rK)=rK(x), the approximation error at the same point. If θ=m(x) is the regression derivative then a(rK)=rK(x) is the derivative of the approximation error.

This means that the bias in the estimator θ^K for θ shown in Theorem 20.8 is simply the approximation error transformed by the functional of interest. If we are estimating the regression function then the bias is the error in approximating the regression function; if we are estimating the regression derivative then the bias is the error in the derivative in the approximation error for the regression function.

20.14 Regression Estimation

A special yet important example of a linear estimator is the regression function at a fixed point x. In the notation of the previous section, a(m)=m(x) and aK=XK(x). The series estimator of m(x) is θ^K=m^K(x)=XK(x)β^K. As this is a key problem of interest we restate the asymptotic result of Theorem 20.8 for this estimator.

Theorem 20.9 Under Assumption 20.2, as n,

n(m^K(x)m(x)+rK(x))VK1/2(x)dN(0,1)

where VK(x)=XK(x)QK1ΩKQK1XK(x).

There are several important features about the asymptotic distribution (20.27).

First, as mentioned in the previous section it shows that the classical variance formula VK(x) applies for the series estimator m^K(x). Second, (20.27) shows that the estimator has the asymptotic bias rK(x). This is due to the fact that the finite order series is an approximation to the unknown regression function m(x) and this results in finite sample bias.

There is another fascinating connection between the asymptotic variance of Theorem 20.9 and the regression lengths ζK(x) of (20.15). Under conditional homoskedasticity we have the simplification VK(x)=σ2ζK(x)2. Thus the asymptotic variance of the regression estimator is proportional to the squared regression lengths. From Figure 20.4 we learned that the regression length ζK(x) is much higher at the edge of the support of the regressors, especially for polynomials. This means that the precision of the series regression estimator is considerably degraded at the edge of the support.

20.15 Undersmoothing

An unpleasant aspect about Theorem 20.9 is the bias term. An interesting trick is that this bias term can be made asymptotically negligible if we assume that K increases with n at a sufficiently fast rate.

Theorem 20.10 Under Assumption 20.2, if in addition nδK20 then

n(m^K(x)m(x))VK1/2(x)dN(0,1)

The condition nδK20 implies that the squared bias converges faster than the estimation variance so the former is asymptotically negligible. If m(x) is uniformly continuous then a sufficient condition for polynomials and quadratic splines is Kn1/4. For linear splines a sufficient condition is for K to diverge faster than K1/4. The rate Kn1/4 is somewhat faster than the ISE-optimal rate Kn1/5.

The assumption nδK20 is often stated by authors as an innocuous technical condition. This is misleading as it is a technical trick and should be discussed explicitly. The reason why the assumption eliminates the bias from (20.28) is that the assumption forces the estimation variance to dominate the squared bias so that the latter can be ignored. This means that the estimator itself is inefficient.

Because nδK20 means that K is larger than optimal we say that m^K(x) is undersmoothed relative to the optimal series estimator.

Many authors like to focus their asymptotic theory on the assumptions in Theorem 20.10 as the distribution (20.28) appears cleaner. However, it is a poor use of asymptotic theory. There are three problems with the assumption nδK20 and the approximation (20.28). First, the estimator m^K(x) is inefficient. Second, while the assumption nδK20 makes the bias of lower order than the variance it only makes the bias of slightly lower order, meaning that the accuracy of the asymptotic approximation is poor. Effectively, the estimator is still biased in finite samples. Third, nδK20 is an assumption not a rule for empirical practice. It is unclear what the statement “Assume nδK20” means in a practical application. From this viewpoint the difference between (20.26) and (20.28) is in the assumptions not in the actual reality nor in the actual empirical practice. Eliminating a nuisance (the asymptotic bias) through an assumption is a trick not a substantive use of theory. My strong view is that the result (20.26) is more informative than (20.28). It shows that the asymptotic distribution is normal but has a non-trivial finite sample bias.

20.16 Residuals and Regression Fit

The fitted regression at x=Xi is m^K(Xi)=XKiβ^K and the fitted residual is e^Ki=Yim^K(Xi). The leave-one-out prediction errors are

e~Ki=Yim^K,i(Xi)=YiXKiβ^K,i

where β^K,i is the least squares coefficient with the ith observation omitted. Using (3.44) we have the simple computational formula

e~Ki=e^Ki(1XKi(XKXK)1XKi)1.

As for kernel regression the prediction errors e~Ki are better estimators of the errors than the fitted residuals e^Ki as the former do not have the tendency to over-fit when the number of series terms is large.

20.17 Cross-Validation Model Selection

A common method for selection of the number of series terms K is cross-validation. The crossvalidation criterion is the sum7 of squared prediction errors

CV(K)=i=1ne~Ki2=i=1ne^Ki2(1XKi(XKXK)1XKi)2.

The CV-selected value of K is the integer which minimizes CV(K).

As shown in Theorem 19.7CV(K) is an approximately unbiased estimator of the integrated meansquared error (IMSE), which is the expected integrated squared error (ISE). The proof of the result is the same for all nonparametric estimators (series as well as kernels) so does not need to be repeated here. Therefore, finding the K which produces the smallest value of CV(K) is a good indicator that the estimator m^K(x) has small IMSE.

For practical implementation we first designate a set of models (sets of basis transformations and number of variables K ) over which to search. (For example, polynomials of order 1 through Kmax for some pre-selected Kmax.) For each, there is a set of regressors XK which are obtained by transformations of the original variables X. For each set we estimate the regression by least squares, calculate the leaveone-out prediction errors, and the CV criterion. Since the errors are a linear operation this is a simple calculation. The CV-selected K is the integer which produces the smallest value of CV(K). Plots of CV(K) against K can aid assessment and interpretation. Since the model order K is an integer the CV criterion for series regression is a discrete function, unlike the case of kernel regression.

If it is desired to produce an estimator m^K(x) with reduced bias it may be preferred to select a value of K slightly higher than that selected by CV alone.

To illustrate, in Figure 20.6 we plot the cross-validation functions for the polynomial regression estimates from Figure 20.1. The lowest point marks the polynomial order which minimizes the crossvalidation function. In panel (a) we plot the CV function for the sub-sample of white women. Here we see that the CV-selected order is p=3, a cubic polynomial. In panel (b) we plot the CV function for the sub-sample of Black women, and find that the CV-selected order is p=2, a quadratic. As expected from visual examination of Figure 20.1, the selected model is more parsimonious for panel (b), most likely because it has a substantially smaller sample size. What may be surprising is that even for panel (a), which has a large sample and smooth estimates, the CV-selected model is still relatively parsimonious.

A user who desires a reduced bias estimator might increase the polynomial orders to p=4 or even p=5 for the subsample of white women and to p=3 or p=4 for the subsample of Black women. Both CV functions are relatively similar across these values.

7 Some authors define CV(K) as the average rather than the sum.

  1. White Women

  1. Black Women

Figure 20.6: Cross-Validation Functions for Polynomial Estimates of Experience Profile

20.18 Variance and Standard Error Estimation

The exact conditional variance of the least squares estimator β^K under independent sampling is

Vβ^=(XKXK)1(i=1nXKiXKiσ2(Xi))(XKXK)1.

The exact conditional variance for the conditional mean estimator m^K(x)=XK(x)β^K is

VK(x)=XK(x)(XKXK)1(i=1nXKiXKiσ2(Xi))(XKXK)1XK(x).

Using the notation of Section 20.7 this equals

1n2i=1nw^K(x,Xi)2σ2(Xi).

In the case of conditional homoskedasticity the latter simplifies to

1nw^K(x,x)σ21nζK(x)2σ2.

where ζK(x) is the normalized regressor length defined in (20.15). Under conditional heteroskedasticty, large samples, and K large (so that w^K(x,Xi) is a local kernel) it approximately equals

1nwK(x,x)σ2(x)=1nζK(x)2σ2(x).

In either case we find that the variance is approximately

VK(x)1nζK(x)2σ2(x).

This shows that the variance of the series regression estimator is a scale of ζK(x)2 and the conditional variance. From the plot of ζK(x) shown in Figure 20.4 we can deduce that the series regression estimator will be relatively imprecise at the boundary of the support of X.

The estimator of (20.31) recommended by Andrews (1991a) is the HC3 estimator

V^β^=(XKXK)1(i=1nXKiXKie~Ki2)(XKXK)1

where e~Ki is the leave-one-out prediction error (20.29). Alternatives include the HC1 or HC2 estimators.

Given (20.32) a variance estimator for m^K(x)=XK(x)β^K is

V^K(x)=XK(x)(XKXK)1(i=1nXKiXKie~Ki2)(XKXK)1XK(x).

A standard error for m^(x) is the square root of V^K(x).

20.19 Clustered Observations

Clustered observations are (Yig,Xig) for individuals i=1,,ng in cluster g=1,,G. The model is

Yig=m(Xig)+eigE[eigXg]=0

where Xg is the stacked Xig. Stack Yig and eig into cluster-level variables Yg and eg.

The series regression model using cluster-level notation is Yg=XgβK+eKg. We can write the series estimator as

β^K=(g=1GXgXg)1(g=1GXgYg).

The cluster-level residual vector is e^g=YgXgβ^K.

As for parametric regression with clustered observations the standard assumption is that the clusters are mutually independent but dependence within each cluster is unstructured. We therefore use the same variance formulae as used for parametric regression. The standard estimator is

V^β^CR1=(GG1)(XKXK)1(g=1GXge^ge^gXg)(XKXK)1.

An alternative is to use the delete-cluster prediction error e~g=YgXgβ~K,g where

β~K,g=(jgXjXj)1(jgXjYj)

leading to the estimator

V^β^CR3=(XKXK)1(g=1GXge~ge~gXg)(XKXK)1.

There is no current theory on how to select the number of series terms K for clustered observations. A reasonable choice is to minimize the delete-cluster cross-validation criterion CV(K)=g=1Ge~ge~g.

20.20 Confidence Bands

When displaying nonparametric estimators such as m^K(x) it is customary to display confidence intervals. An asymptotic pointwise 95 confidence interval for m(x) is m^K(x)±1.96V^K1/2(x). These confidence intervals can be plotted along with m^K(x).

To illustrate, Figure 20.7 plots polynomial estimates of the regression of log( wage ) on experience using the selected estimates from Figure 20.1, plus 95% confidence bands. Panel (a) plots the estimate for the subsample of white women using p=5. Panel (b) plots the estimate for the subsample of Black women using p=3. The standard errors are calculated using the formula (20.33). You can see that the confidence bands widen at the boundaries. The confidence bands are tight for the larger subsample of white women, and significantly wider for the smaller subsample of Black women. Regardless, both plots indicate that the average wage rises for experience levels up to about 20 years and then flattens for experience levels above 20 years.

  1. White Women

  1. Black Women

Figure 20.7: Polynomial Estimates with 95% Confidence Bands

There are two deficiencies with these confidence bands. First, they do not take into account the bias rK(x) of the series estimator. Consequently, we should interpret the confidence bounds as valid for the pseudo-true regression (the best finite K approximation) rather than the true regression function m(x). Second, the above confidence intervals are based on a pointwise (in x ) asymptotic distribution theory. Consequently we should interpret their coverage as having pointwise validity and be cautious about interpreting global shapes from the confidence bands.

20.21 Uniform Approximations

Since m^K(x) is a function it is desirable to have a distribution theory which applies to the entire function, not just the estimator at a point. This can be used, for example, to construct confidence bands with uniform (in x ) coverage properties. For those familiar with empirical process theory, it might be hoped that the stochastic process

ηK(x)=n(m^K(x)m(x))VK1/2(x)

might converge to a stochastic (Gaussian) process, but this is not the case. Effectively, the process ηK(x) is not stochastically equicontinuous so conventional empirical process theory does not apply.

To develop a uniform theory, Belloni, Chernozhukov, Chetverikov, and Kato (2015) have introduced what are known as strong approximations. Their method shows that ηK(x) is equal in distribution to a sequence of Gaussian processes plus a negligible error. Their theory (Theorem 4.4) takes the following form. Under stronger conditions than Assumption 20.2

ηK(x)=dXK(x)(QK1ΩKQK1)1/2VK1/2(x)GK+op(1)

uniformly in x, where ” =d ” means “equality in distribution” and GKN(0,IK).

This shows the distributional result in Theorem 20.10 can be interpreted as holding uniformly in x. It can also be used to develop confidence bands (different from those from the previous section) with asymptotic uniform coverage.

20.22 Partially Linear Model

A common use of a series regression is to allow m(x) to be nonparametric with respect to one variable yet linear in the other variables. This allows flexibility in a particular variable of interest. A partially linear model with vector-valued regressor X1 and real-valued continuous X2 takes the form

m(x1,x2)=x1β1+m2(x2).

This model is common when X1 are discrete (e.g. binary) and X2 is continuously distributed.

Series methods are convenient for partially linear models as we can replace the unknown function m2(x2) with a series expansion to obtain

m(X)mK(X)=X1β1+X2K(X2)β2K=XKβK

where X2K=X2K(x2) are basis transformations of x2 (typically polynomials or splines). After transformation the regressors are XK=(X1,X2K) with coefficients βK=(β1,β2K).

20.23 Panel Fixed Effects

The one-way error components nonparametric regression model is

Yit=m(Xit)+ui+εit

for i=1,,N and t=1,,T. It is standard to treat the individual effect ui as a fixed effect. This model can be interpreted as a special case of the partially linear model from the previous section though the dimension of ui is increasing with N.

A series estimator approximates the function m(x) with mK(x)=XK(x)βK as in (20.4). This leads to the series regression model Yit=XKitβK+ui+εKit where XKit=XK(Xit).

The fixed effects estimator is the same as in linear panel data regression. First, the within transformation is applied to Yit and to the elements of the basis transformations XKit. These are Y˙it=YitY¯i and X˙Kit=XKitX¯Kit. The transformed regression equation is Y˙it=X˙KitβK+ε˙Kit. What is important about the within transformation for the regressors is that it is applied to the transformed variables X˙Kit not the original regressor Xit. For example, in a polynomial regression the within transformation is applied to the powers Xitj. It is inappropriate to apply the within transformation to Xit and then construct the basis transformations.

The coefficient is estimated by least squares on the within transformed variables

β^K=(i=1nt=1TX˙KitX˙Kit)1(i=1nt=1TX˙KitY˙it).

Variance estimators should be calculated using the clustered variance formulas, clustered at the level of the individual i, as described in Section 20.19.

For selection of the number of series terms K there is no current theory. A reasonable method is to use delete-cluster cross-validation as described in Section 20.19.

20.24 Multiple Regressors

Suppose XRd is vector-valued and continuously distributed. A multivariate series approximation can be obtained as follows. Construct a set of basis transformations for each variable separately. Take their tensor cross-products. Use these as regressors. For example, a pth-order polynomial is

mK(x)=β0+j1=1pjd=1px1j1xdjdβj1,,jdK

This includes all powers and cross-products. The coefficient vector has dimension K=1+pd.

The inclusion of cross-products greatly increases the number of coefficients relative to the univariate case. Consequently series applications with multiple regressors typically require large sample sizes.

20.25 Additively Separable Models

As discussed in the previous section, when XRd a full series expansion requires a large number of coefficients, which means that estimation precision will be low unless the sample size is quite large. A common simplification is to treat the regression function m(x) as additively separable in the individual regressors. This means that

m(x)=m1(x1)+m2(x2)++md(xd).

We then apply series expansions (polynomials or splines) separately for each component mj(xj). Essentially, this is the same as the expansions discussed in the previous section but omitting the interaction terms.

The advantage of additive separability is the reduction in dimensionality. While an unconstrained pth order polynomial has 1+pd coefficients, an additively separable polynomial model has only 1+dp coefficients. This is a major reduction.

The disadvantage of additive separability is that the interaction effects have been eliminated. This is a substantive restriction on m(x).

The decision to impose additive separability can be based on an economic model which suggests the absence of interaction effects, or can be a model selection decision similar to the selection of the number of series terms.

20.26 Nonparametric Instrumental Variables Regression

The basic nonparametric instrumental variables (NPIV) model takes the form

Y=m(X)+eE[eZ]=0

where Y,X, and Z are real valued. Here, Z is an instrumental variable and X an endogenous regressor.

In recent years there have been many papers in the econometrics literature examining the NPIV model, exploring identification, estimation, and inference. Many of these papers are mathematically advanced. Two important and accessible contributions are Newey and Powell (2003) and Horowitz (2011). Here we describe some of the primary results.

A series estimator approximates the function m(x) with mK(x)=XK(x)βK as in (20.4). This leads to the series structural equation

Y=XKβK+eK

where XK=XK(X). For example, if a polynomial basis is used then XK=(1,X,,XK1).

Since X is endogenous so is the entire vector XK. Thus we need at least K instrumental varibles. It is useful to consider the reduced form equation for X. A nonparametric specification is

X=g(Z)+uE[uZ]=0.

We can appropriate g(z) by the series expansion

g(z)gL(z)=ZL(z)γL

where ZL(z) is an L×1 vector of basis transformations and γL is an L×1 coefficient vector. For example, if a polynomial basis is used then ZL(z)=(1,z,,zL1). Most of the literature for simplicity focuses on the case L=K, but this is not essential to the method.

If LK we can then use ZL=ZL(Z) as instruments for XK. The 2 SLS estimator β^K,L of βK is

β^K,L=(XKZL(ZLZL)1ZLXK)1(XKZL(ZLZL)1ZLY).

The estimator of m(x) is m^K(x)=XK(x)β^K,L. If L>K the linear GMM estimator can be similarly defined.

One way to think about the choice of instruments is to realize that we are actually estimating reduced form equations for each element of XK. The reduced form system is

XK=ΓKZL+uKΓK=E[ZLZL]1E[ZLXK].

For example, suppose we use a polynomial basis with K=L=3. Then the reduced form system (ignoring intercepts) is

[XX2X3]=[Γ11Γ21Γ31Γ12Γ22Γ32Γ13Γ13Γ23][ZZ2Z3]+[u1u2u3].

This is modeling the conditional mean of X,X2, and X3 as linear functions of Z,Z2, and Z3.

To understand if the coefficient βK is identified it is useful to consider the simple reduced form equation X=γ0+γ1Z+u. Assume that γ10 so that the equation is strongly identified and assume for simplicity that u is independent of Z with mean zero and variance σu2. The identification properties of the reduced form are invariant to rescaling and recentering X and Z so without loss of generality we can set γ0=0 and γ1=1. Then we can calculate that the coefficient matrix in (20.36) is

[Γ11Γ21Γ31Γ12Γ22Γ32Γ13Γ13Γ23]=[1000103σu201].

Notice that this is lower triangular and full rank. It turns out that this property holds for any values of K=L so the coefficient matrix in (20.36) is full rank for any choice of K=L. This means that identification of the coefficient βK is strong if the reduced form equation for X is strong. Thus to check the identification condition for βK it is sufficient to check the reduced form equation for X. A critically important caveat, however, as discussed in the following section, is that identification of βK does not mean that the structural function m(x) is identified.

A simple method for pointwise inference is to use conventional methods to estimate VK,L=var[β^K,L] and then estimate var[m^K(x)] by XK(x)V^K,LXK(x) as in series regression. Bootstrap methods are typically advocated to achieve better coverage. See Horowitz (2011) for details. For state-of-the-art inference methods see Chen and Pouzo (2015) and Chen and Christensen (2018).

20.27 NPIV Identification

In the previous section we discussed identication of the pseudo-true coefficient βK. In this section we discuss identification of the structural function m(x). This is considerably more challenging.

To understand how the function m(x) is determined, apply the expectation operator E[Z=z] to (20.34). We find

E[YZ=z]=E[m(X)Z=z]

with the remainder equal to zero because E[eZ]=0. We can write this equation as

μ(z)=m(x)f(xz)dx

where μ(z)=E[YZ=z] is the CEF of Y given Z=z and f(xz) is the conditional density of X given Z. These two functions are identified 8 from the joint distribution of (Y,X,Z). This means that the unknown function m(x) is the solution to the integral equation (20.37). Conceptually, you can imagine estimating μ(z) and f(xz) using standard techniques and then finding the solution m(x). In essence, this is how m(x) is defined and is the nonparametric analog of the classical relationship between the structural and reduced forms.

Unfortunately the solution m(x) may not be unique even in situations where a linear IV model is strongly identified. It is related to what is known as the ill-posed inverse problem. The latter means that the solution m(x) is not necessarily a continuous function of μ(z). Identification requires restricting the class of allowable functions f(xz). This is analogous to the linear IV model where identification requires restrictions on the reduced form equations. Specifying and understanding the needed restrictions is more subtle than in the linear case.

The function m(x) is identified if it is the unique solution to (20.37). Equivalently, m(x) is not identified if we can replace m(x) in (20.37) with m(x)+δ(x) for some non-trivial function δ(x) yet the solution does not change. The latter occurs when

δ(x)f(xz)dx=0

8 Technically, if E|Y|<, the joint density of (Z,X) exists, and the marginal density of Z is positive. for all z. Equivalently, m(x) is identified if (and only if) (20.38) holds only for the trivial function δ(x)=0.

Newey and Powell (2003) defined this fundamental condition as completeness.

Proposition 20.1 Completeness. m(x) is identified if (and only if) the completeness condition holds: (20.38) for all z implies δ(x)=0.

Completeness is a property of the reduced form conditional density f(xz). It is unaffected by the structural equation m(x). This is analogous to the linear IV model where identification is a property of the reduced form equations, not a property of the structural equation.

As we stated above, completeness may not be satisfied even if the reduced form relationship is strong. This may be easiest to see by a constructed example 9. Suppose that the reduced form is X=Z+u, var[Z]=1,u is independent of Z, and u is distributed U[1,1]. This reduced form equation has R2= 0.75 so is strong. The reduced form conditional density is f(xz)=1/2 on [1+z,1+z]. Consider δ(x)=sin(x/π). We calculate that

δ(x)f(xz)dx=1+z1+zsin(x/π)dx=0

for every z, because sin(x/π) is periodic on intervals of length 2 and integrates to zero over [1,1]. This means that equation (20.37) holds 10 for m(x)+sin(x/π). Thus m(x) is not identified. This is despite the fact that the reduced form equation is strong.

While identification fails for some conditional distributions, it does not fail for all. Andrews (2017) provides classes of distributions which satisfy the completeness condition and shows that these distribution classes are quite general.

What does this mean in practice? If completeness fails then the structural equation is not identified and cannot be consistently estimated. Furthermore, by analogy with the weak instruments literature, we expect that if the conditional distribution is close to incomplete then the structural equation will be poorly identified and our estimators will be imprecise. Since whether or not the conditional distribution is complete is unknown (and more difficult to assess than in the linear model) this is troubling for empirical research. Effectively, in any given application we do not know whether or not the structural function m(x) is identified.

A partial answer is provided by Freyberger (2017). He shows that the joint hypothesis of incompleteness and small asymptotic bias can be tested. By applying the test proposed in Freyberger (2017) a user can obtain evidence that their NPIV estimator is well-behaved in the sense of having low bias. Unlike Stock and Yogo (2005), however, Freyberger’s result does not address inference.

20.28 NPIV Convergence Rate

As described in Horowitz (2011), the convergence rate of m^K(x) for m(x) is

|m^K(x)m(x)|=Op(Ks+Kr(Kn)1/2)

9 This example was suggested by Joachim Freyberger.

10 In fact, (20.38) holds for m(x)+δ(x) for any function δ(x) which is periodic on intervals of length 2 and integrates to zero on [1,1] where s is the smoothness 11 of m(x) and r is the smoothness of the joint density fXZ(x,z) of (X,Z). The first term Ks is the bias due to the approximation of m(x) by mK(x) and takes the same form as for series regression. The second term Kr(K/n)1/2 is the standard deviation of m^K(x). The component (K/n)1/2 is the same as for series regression. The extra component Kr is due to the ill-posed inverse problem (see the previous section).

From the rate (20.39) we can calculate that the optimal number of series terms is Kn1/(2r+2s+1). Given this rate the best possible convergence rate in (20.39) is Op(ns/(2r+2s+1)). For r>0 these rates are slower than for series regression. If we consider the case s=2 these rates are Kn1/(2r+5) and Op(n2/(2r+5)), which are slower than the Kn1/5 and Op(n2/5) rates obtained by series regression.

A very unusual aspect of the rate (20.39) is that smoothness of fXZ(x,z) adversely affects the convergence rate. Larger r means a slower rate of convergence. The limiting case as r (for example, joint normality of X and Z ) results in a logarithmic convergence rate. This seems very strange. The reason is that when the density fXZ(x,z) is very smooth the data contain little information about the function m(x). This is not intuitive and requires a deeper mathematical treatment.

A practical implication of the convergence rate (20.39) is that the number of series terms K should be much smaller than for regression estimation. Estimation variance increases quickly as K increases. Therefore K should not be taken to be too large. In practice, however, it is unclear how to select the series order K as standard cross-validation methods do not apply.

20.29 Nonparametric vs Parametric Identification

One of the insights from the nonparametric identification literature is that it is important to understand which features of a model are nonparametrically identified, meaning which are identified without functional form assumptions, and which are only identified based on functional form assumptions. Since functional form assumptions are dubious in most economic applications the strong implication is that researchers should strive to work only with models which are nonparametrically identified.

Even if a model is determined to be nonparametrically identified a researcher may estimate a linear (or another simple parametric) model. This is valid because it can be viewed as an approximation to the nonparametric structure. If, however, the model is identified only under a parametric assumption, then it cannot be viewed as an approximation and it is unclear how to interpret the model more broadly.

For example, in the regression model Y=m(X)+e with E[eX]=0 the CEF is nonparametrically identified by Theorem 2.14. This means that researchers who estimate linear regressions (or other lowdimensional regressions) can interpret their estimated model as an approximation to the underlying CEF.

As another example, in the NPIV model where E[eZ]=0 the structural function m(x) is identified under the completeness condition. This means that researchers who estimate linear 2SLS regressions can interpret their estimated model as an approximation to m(x) (subject to the caveat that it is difficult to know if completeness holds).

But the analysis can also point out simple yet subtle mistakes. Take the simple IV model with one exogenous regressor X1 and one endogenous regressor X2

Y=β0+β1X1+β2X2+eE[eX1]=0

with no additional instruments. Suppose that an enterprising researcher suggests using the instrument X12 for X2, using the reasoning that the assumptions imply that E[X12e]=0 so X12 is a valid instrument.

11 The number of bounded derivatives. The trouble is that the basic model is not nonparametrically identified. If we write (20.40) as a partially linear nonparametric IV problem

Y=m(X1)+β2X2+eE[eX1]=0

then we can see that this model is not identified. We need a valid excluded instrument Z. Since (20.41) is not identified, then (20.40) cannot be viewed as a valid approximation. The apparent identification of (20.40) critically rests on the unknown truth of the linearity in (20.40).

The point of this example is that (20.40) should never be estimated by 2 SLS using the instrument X12 for X2, fundamentally because the nonparametric model (20.41) is not identified.

Another way to describe the mistake is to observe that X12 is a valid instrument in (20.40) only if it is a valid exclusion restriction from the structural equation (20.40). Viewed in the context of (20.41) we can see that this is a functional form restriction. As stated above, identification based on functional form restrictions alone is highly undesirable because functional form assumptions are dubious.

20.30 Example: Angrist and Lavy (1999)

To illustrate nonparametric instrumental variables in practice we follow Horowitz (2011) by extending the empirical work reported in Angrist and Lavy (1999). Their paper is concerned with measuring the causal effect of the number of students in an elementary school classroom on academic achievement. They address this using a sample of 4067 Israeli 4th and 5th grade classrooms. The dependent variable is the classroom average score on an achievement test. Here we consider the reading score avgverb. The explanatory variables are the number of students in the classroom (classize), the number of students in the grade at the school (enrollment), and a school-level index of students’ socioeconomic status that the authors call percent disadvantaged. The variables enrollment and disadvantaged are treated as exogenous but classize is treated as endogenous because wealthier schools may be able to offer smaller class sizes.

The authors suggest the following instrumental variable for classsize. Israeli regulations specify that class sizes must be capped at 40. This means that classize should be perfectly predictable from enrollment. If the regulation is followed a school with up to 40 students will have one classroom in the grade and schools with 41-80 students will have two classrooms. The precise prediction is that classsize equals

p= enrollment 1+1 enrollment /40

where a is the integer part of a. Angrist and Lavy use p as an instrumental variable for classize.

They estimate several specifications. We focus on equation (6) from their Table VII which specifies avgverb as a linear function of classize, disadvantaged, enrollment, grade4, and the interaction of classize and disadvantaged, where grade4 is a dummy indicator for 4th grade classrooms. The equation is estimated by instrumental variables, using p and p× disadvantaged as instruments. The observations are treated as clustered at the level of the school. Their estimates show a negative and statistically significant impact of classsize on reading test scores.

We are interested in a nonparametric version of their equation. To keep the specification reasonably parsimonious yet flexible we use the following equation.

 avgverb =β1( classize 40)+β2( classize 40)2+β3( classize 40)3+β4( disadvantaged 14)+β5( disadvantaged 14)2+β6( disadvantaged 14)3+β7( classize 40)( disadvantaged 14)+β8 enrollment +β9 grade 4+β10+e.

This is a cubic equation in classize and disadvantaged, with a single interaction term, and linear in enrollment and grade4. The cubic in disadvantaged was selected by a delete-cluster cross-validation regression without classize. The cubic in classize was selected to allow for a minimal degree of nonparametric flexibility without overparameterization. The variables classize and disadvantaged were scaled by 40 and 14 , respectively, so that the regression is well conditioned. The scaling for classize was selected so that the variable essentially falls in [0,1] and the scaling for disadvantaged was selected so that its mean is 1.

Table 20.1: Nonparametric Instrumental Variable Regression for Reading Test Score

classize/40 34.2
(33.4)
classize/40) 2 61.2
(53.0)
(classize/40) 3 29.0
(26.8)
disadvantaged/14 12.4
(1.7)
(disadvantaged/14) 2 3.33
(0.54)
(disadvantaged/14) 3 0.377
(0.078)
(classize/40)(disadvantaged/14) 0.81
(1.77)
enrollment 0.015
(0.007)
grade 4 1.96
(0.16)
Intercept 77.0
(6.9)

The equation is estimated by 2 SLS using (p/40),(p/40)2,(p/40)3 and (p/40)×( disadvantaged/14) as instruments for the four variables involving classize. The parameter estimates are reported in Table 20.1. The standard errors are clustered at the level of the school. Most of the individual coefficients do not have interpretable meaning, except the positive coefficient on enrollment shows that larger schools achieve slightly higher testscores, and the negative coefficient on grade4 shows that 4th  grade students have somewhat lower testscores than 5th  grade students.

To obtain a better interpretation of the results we display the estimated regression functions in Figure 20.8. Panel (a) displays the estimated effect of classize on reading test scores. Panel (b) displays the estimated effect of disadvantaged. In both figures the other variables are set at their sample means 12.

12 If they are set at other values it does not change the qualitative nature of the plots.

  1. Effect of Classize

  1. Effect of Percent Disadvantaged

Figure 20.8: Nonparametric Instrumental Variables Estimates of the Effect of Classize and Disadvantaged on Reading Test Scores

In panel (a) we can see that increasing class size decreases the average test score. This is consistent with the results from the linear model estimated by Angrist and Lavy (1999). The estimated effect is remarkably close to linear.

In panel (b) we can see that increasing the percentage of disadvantaged students greatly decreases the average test score. This effect is substantially greater in magnitude than the effect of classsize. The effect also appears to be nonlinear. The effect is precisely estimated with tight pointwise confidence bands.

We can also use the estimated model for hypothesis testing. The question addressed by Angrist and Lavy was whether or not classsize has an effect on test scores. Within the nonparametric model estimated here this hypothesis holds under the linear restriction H0:β1=β2=β3=β7=0. Examining the individual coefficient estimates and standard errors it is unclear if this is a significant effect as none of these four coefficient estimates is statistically different from zero. This hypothesis is better tested by a Wald test (using cluster-robust variance estimates). This statistic is 12.7 which has an asymptotic p-value of 0.013. This suppports the hypothesis that class size has a negative effect on student performance.

We can also use the model to quantify the impact of class size on test scores. Consider the impact of increasing a class from 20 to 40 students. In the above model the predicted impact on test scores is

θ=12β1+34β2+78β3+12β4.

This is a linear function of the coefficients. The point estimate is θ^=2.96 with a standard error of 1.21. (The point estimate is identical to the difference between the endpoints of the estimated function shown in panel (a).) This is a small but substantive impact.

20.31 Technical Proofs*

Proof of Theorem 20.4. We provide a proof under the stronger assumption ζK2K/n0. (The proof presented by Belloni, Chernozhukov, Chetverikov, and Kato (2015) requires a more advanced treatment.) Let AF denote the Frobenius norm (see Section A.23), and write the jth element of X~Ki as X~jKi. Using (A.18),

Q~KIK2Q~KIKF2=j=1K=1K(1ni=1n(X~jKiX~KiE[X~jKiX~Ki]))2.

Then

E[Q~KIK2]j=1K=1Kvar[1ni=1nX~jKiX~Ki]=1nj=1K=1Kvar[X~jKiX~Ki]1nE[j=1KX~jKi2=1KX~Ki2]=1nE[(X~KiX~Ki)2]ζK2nE[X~KiX~Ki]=ζK2Kn0

where final lines use (20.17), E[X~KiX~Ki]=K, and ζK2K/n0. Markov’s inequality implies (20.19).

Proof of Theorem 20.5. By the spectral decomposition we can write Q~K=HΛH where HH=IK and Λ=diag(λ1,,λK) are the eigenvalues. Then

Q~KIK=H(ΛIK)H=ΛIK=maxjK|λj1|p0

by Theorem 20.4. This implies minjK|λj|p1 which is (20.21). Similarly

Q~K1IK=H(Λ1IK)H=Λ1IK=maxjK|λj11|maxjK|1λj|minjK|λj|p0.

Proof of Theorem 20.6. Using (20.12) we can write

m^K(x)m(x)=XK(x)(β^KβK)rK(x).

Since eK=rK+e is a projection error it satisfies E[XKeK]=0. Since e is a regression error it satisfies E[XKe]=0. We deduce E[XKrK]=0. Hence XK(x)rK(x)f(x)dx=E[XKrK]=0. Also observe that XK(x)XK(x)dF(x)=QK and rK(x)2dF(x)=E[rK2]=δK2. Then

ISE(K)=(XK(x)(β^KβK)rK(x))2dF(x)=(β^KβK)(XK(x)XK(x)dF(x))(β^KβK)2(β^KβK)(XK(x)rK(x)dF(x))+rK(x)2dF(x)=(β^KβK)QK(β^KβK)+δK2

We calculate that

(β^KβK)QK(β^KβK)=(eKXK)(XKXK)1QK(XKXK)1(XKeK)=(eKX~K)(X~KX~K)1(X~KX~K)1(X~KeK)=n2(eKX~K)Q~K1Q~K1(X~KeK)(λmax(Q~K1))2(n2eKX~KX~KeK)Op(1)(n2eKXKQK1XKeK)

where X~K and Q~K are the orthogonalized regressors as defined in (20.18). The first inequality is the Quadratic Inequality (B.18), the second is (20.21).

Using the fact that XKeK are mean zero and uncorrelated, (20.17), E[eK2]E[Y2]<, and Assumption 20.1.2,

E[n2eKXKQK1XKeK]=n1E[XKQK1XKeK2]ζK2nE[eK2]o(1).

This shows that (20.45) is op (1). Combined with (20.44) we find ISE(K)=op(1) as claimed.

Proof of Theorem 20.7. The assumption σ2(x)σ¯2 implies that

E[eK2X]=E[(rK+e)2X]=rK2+σ2(X)rK2+σ¯2.

Thus (20.46) is bounded by

n1E[XKQK1XKrK2]+n1E[XKQK1XK]σ¯2ζK2nE[rK2]+n1E[tr(QK1XKXK)]σ¯2=ζK2nδK2+n1tr(IK)σ¯2o(δK2)+Knσ¯2

where the inequality is Assumption 20.1.2. This implies (20.45) is op(δK2)+Op(K/n). Combined with (20.44) we find ISE(K)=Op(δK2+K/n) as claimed.

Proof of Theorem 20.8. Using (20.12) and linearity

θ=a(m)=a(ZK(x)βK)+a(rK)=aKβK+a(rK).

Thus

nVK(θ^Kθ+a(rK))=nVKaK(β^KβK)=1nVKaKQ^K1XKeK=1nVKaKQK1XKe+1nVKaK(Q^K1QK1)XKe+1nVKaKQ^K1XKrK

where we have used eK=e+rK. We take the terms in (20.47)-(20.49) separately. We show that (20.47) is asymptotically normal and (20.48)-(20.49) are asymptotically negligible.

First, take (20.47). We can write

1nVKaKQK1XKe=1ni=1n1VKaKQK1XKiei.

Observe that aKQK1XKiei/VK are independent across i, mean zero, and have variance 1 . We will apply Theorem 6.4, for which it is sufficient to verify Lindeberg’s condition: For all ϵ>0

E[(aKQK1XKe)2VK1{(aKQK1XKe)2VKnϵ}]0.

Pick η>0. Set B sufficiently large so that E[e21{e2>B}X]σ2η which is feasible by Assumption 20.2.1. Pick n sufficiently large so that ζK2/nϵσ2/B, which is feasible under Assumption 20.1.2.

By Assumption 20.2.2

VK=E[(aKQK1XK)2e2]=E[(aKQK1XK)2σ(X2)]E[(aKQK1XK)2σ2]=aKQK1E[XKXK]QK1aKσ2=aKQK1aKσ2.

Then by the Schwarz Inequality, (20.17), (20.52), and ζK2/nϵσ2/B

(aKQK1XK)2VK(aKQK1aK)(XKQK1XK)VKζK2σ2ϵBn.

Then the left-side of (20.51) is smaller than

E[(aKQK1XK)2VKe21{e2B}]=E[(aKQK1XK)2VKE[e21{e2B}X]]E[(aKQK1XK)2VK]σ2ηaKQK1aKVKσ2ηη

the final inequality by (20.52). Since η is arbitrary this verifies (20.51) and we conclude

1nVKaKQK1XKedN(0,1)

Second, take (20.48). Assumption 20.2 implies E[e2X]σ¯2<. Since E[eX]=0, applying E[e2X]σ¯2, the Schwarz and Norm Inequalities, (20.52), and Theorems 20.4 and 20.5,

E[(1nVKaK(Q^K1QK1)XKe)2X]=1nVKaK(Q^K1QK1)XKE[eeX]XK(Q^K1QK1)aKσ¯2VKaK(Q^K1QK1)Q^K(Q^K1QK1)aKσ¯2aKQK1aKVK(Q^K1QK1)Q^K(Q^K1QK1)=σ¯2aKQK1aKVK(IKQ~K)(Q~K1IK)σ¯2σ2IKQ~KQ~K1IKσ¯2σ2op(1).

This establishes that (20.48) is op(1).

Third, take (20.49). By the Cauchy-Schwarz inequality, the Quadratic Inequality, (20.52), and (20.21),

(1nvKaKQ^K1XKrK)2aKQK1aKnvKrKXKQ^K1QKQ^K1XKrK1σ2(λmaxQ~K1)21nrKXKQK1XKrKOp(1)1nrKXKQK1XKrK.

Observe that because the observations are independent, E[XKrK]=0,XKiQK1XKiζK2, and E[rK2]=δK2,

E[1nrKXKQK1XKrK]=E[1ni=1nrKiXKiQK1ij=1nXKjrKj]=E[XKQK1XKrK2]ζK2E[rK2]=ζK2δK2=o(1)

under Assumption 20.2.3. Thus 1nrKXKQK1XKrK=op(1), (20.54) is op(1), and (20.49) is op(1).

Together, we have shown that

nVK(θ^KθK+a(rK))dN(0,1)

as claimed. Proof of Theorem 20.10. It is sufficient to show that

nVK1/2(x)rK(x)=o(1)

Notice that by Assumption 20.2.2

VK(x)=XK(x)QK1ΩKQK1XK(x)=E[(XK(x)QK1XK)2e2]=E[(XK(x)QK1XK)2σ2(X)]E[(XK(x)QK1XK)2]σ2=XK(x)QK1E[XKXK]QK1XK(x)σ2=XK(x)QK1XK(x)σ2=ζK(x)2σ2.

Using the definitions for βK,rK(x), and δK from Section 20.8, note that

rK(x)=m(x)XK(x)βK=rK(x)+XK(x)(βKβK).

By the Triangle Inequality, the definition (20.10), the Schwarz Inequality, and definition (20.15)

|rK(x)||rK(x)|+|XK(x)(βKβK)|δK+|XK(x)QK1XK(x)|1/2|(βKβK)QK(βKβK)|1/2=δK+ζK(x)|(βKβK)QK(βKβK)|1/2.

The coefficients satisfy the relationship

βK=E[XKXK]1E[XKm(X)]=βK+E[XKXK]1E[XKrK].

Thus

(βKβK)QK(βKβK)=E[rKXK]E[XKXK]1E[XKrK]E[rK2]δK2.

The first inequality is because E[rKXK]E[XKXK]1E[XKrK] is a projection. The second inequality follows from the definition (20.10). We deduce that

|rK(x)|(1+ζK(x))δK2ζK(x)δK.

Equations (20.56), (20.57), and nδK2=o(1) together imply that

nVK(x)rK2(x)4σ2nδK2=o(1)

which is (20.55), as required.

20.32 Exercises

Exercise 20.1 Take the estimated model

Y=1+2X+5(X1)1{X1}3(X2)1{X2}+e.

What is the estimated marginal effect of X on Y for X=3 ?

Exercise 20.2 Take the linear spline with three knots

mK(x)=β0+β1x+β2(xτ1)1{xτ1}+β3(xτ2)1{xτ2}+β4(xτ3)1{xτ3}.

Find the inequality restrictions on the coefficients βj so that mK(x) is non-decreasing.

Exercise 20.3 Take the linear spline from the previous question. Find the inequality restrictions on the coefficients βj so that mK(x) is concave.

Exercise 20.4 Take the quadratic spline with three knots

mK(x)=β0+β1x+β2x3+β3(xτ1)21{xτ1}+β4(xτ2)21{xτ2}+β5(xτ3)21{xτ3}.

Find the inequality restrictions on the coefficients βj so that mK(x) is concave.

Exercise 20.5 Consider spline estimation with one knot τ. Explain why the knot τ must be within the sample support of X. [Explain what happens if you estimate the regression with the knot placed outside the support of X]

Exercise 20.6 You estimate the polynomial regression model:

m^K(x)=β^0+β^1x+β^2x2++β^pxp.

You are interested in the regression derivative m(x) at x.

  1. Write out the estimator m^K(x) of m(x).

  2. Is m^K(x) is a linear function of the coefficient estimates?

  3. Use Theorem 20.8 to obtain the asymptotic distribution of m^K(x).

  4. Show how to construct standard errors and confidence intervals for m^K(x).

Exercise 20.7 Does rescaling Y or X (multiplying by a constant) affect the CV(K) function? The K which minimizes it?

Exercise 20.8 Take the NPIV approximating equation (20.35) and error eK.

  1. Does it satisfy E[eKZ]=0 ?

  2. If L=K can you define βK so that E[ZKeK]=0 ?

  3. If L>K does E[ZKeK]=0 ?

Exercise 20.9 Take the cps09mar dataset (full sample). (a) Estimate a 6th  order polynomial regression of log( wage ) on experience. To reduce the ill-conditioned problem first rescale experience to lie in the interval [0,1] before estimating the regression.

  1. Plot the estimated regression function along with 95% pointwise confidence intervals.

  2. Interpret the findings. How do you interpret the estimated function for experience levels above 65 ?

Exercise 20.10 Continuing the previous exercise, compute the cross-validation function (or alternatively the AIC) for polynomial orders 1 through 8.

  1. Which order minimizes the function?

  2. Plot the estimated regression function along with 95 pointwise confidence intervals.

Exercise 20.11 Take the cps09mar dataset (full sample).

  1. Estimate a 6th  order polynomial regression of log( wage ) on education. To reduce the ill-conditioned problem first rescale education to lie in the interval [0,1].

  2. Plot the estimated regression function along with 95 pointwise confidence intervals.

Exercise 20.12 Continuing the previous exercise, compute the cross-validation function (or alternatively the AIC) for polynomial orders 1 through 8.

  1. Which order minimizes the function?

  2. Plot the estimated regression function along with 95 pointwise confidence intervals.

Exercise 20.13 Take the cps09mar dataset (full sample).

  1. Estimate quadratic spline regressions of log( wage ) on experience. Estimate four models: (1) no knots (a quadratic); (2) one knot at 20 years; (3) two knots at 20 and 40; (4) four knots at 10, 20, 30, &40. Plot the four estimates. Intrepret your findings.

  2. Compare the four splines models using either cross-validation or AIC. Which is the preferred specification?

  3. For your selected specification plot the estimated regression function along with 95% pointwise confidence intervals. Intrepret your findings.

  4. If you also estimated a polynomial specification do you prefer the polynomial or the quadratic spline estimates?

Exercise 20.14 Take the cps09mar dataset (full sample).

  1. Estimate quadratic spline regressions of log( wage ) on education. Estimate four models: (1) no knots (a quadratic); (2) one knot at 10 years; (3) three knots at 5,10 , and 15 ; (4) four knots at 4,8 , 12, & 16. Plot the four estimates. Intrepret your findings.

  2. Compare the four splines models using either cross-validation or AIC. Which is the preferred specification?

  3. For your selected specification plot the estimated regression function along with 95% pointwise confidence intervals. Intrepret your findings. (d) If you also estimated a polynomial specification do you prefer the polynomial or the quadratic spline estimates?

Exercise 20.15 The RR2010 dataset is from Reinhart and Rogoff (2010). It contains observations on annual U.S. GDP growth rates, inflation rates, and the debt/gdp ratio for the long time span 1791-2009. The paper made the strong claim that GDP growth slows as debt/gdp increases, and in particular that this relationship is nonlinear with debt negatively affecting growth for debt ratios exceeding 90. Their full dataset includes 44 countries, our extract only includes the United States. Let Yt denote GDP growth and let Dt denote debt/gdp. We will estimate the partially linear specification

Yt=αYt1+m(Dt1)+et

using a linear spline for m(D).

  1. Estimate (1) linear model; (2) linear spline with one knot at Dt1=60; (3) linear spline with two knots at 40 and 80 . Plot the three estimates.

  2. For the model with one knot plot with 95 confidence intervals.

  3. Compare the three splines models using either cross-validation or AIC. Which is the preferred specification?

  4. Interpret the findings.

Exercise 20.16 Take the DDK2011 dataset (full sample). Use a quadratic spline to estimate the regression of testscore on percentile.

  1. Estimate five models: (1) no knots (a quadratic); (2) one knot at 50; (3) two knots at 33 and 66; (4) three knots at 25,50 & 75 ; (5) knots at 20, 40, 60, & 80. Plot the five estimates. Intrepret your findings.

  2. Select a model. Consider using leave-cluster-one CV.

  3. For your selected specification plot the estimated regression function along with 95% pointwise confidence intervals. [Use cluster-robust standard errors.] Intrepret your findings.

Exercise 20.17 The CH J2004 dataset is from Cox, Hansen and Jimenez (2004). As described in Section 20.6 it contains a sample of 8684 urban Phillipino households. This paper studied the crowding-out impact of a family’s income on non-governmental transfers. Estimate an analog of Figure 20.2(b) using polynomial regression. Regress transfers on a high-order polynomial in income, and possibly a set of regression controls. Ideally, select the polynomial order by cross-validation. You will need to rescale the variable income before taking polynomial powers. Plot the estimated function along with 95 pointwise confidence intervals. Comment on the similarities and differences with Figure 20.2(b). For the regression controls consider the following options: (a) Include no additional controls; (b) Follow the original paper and Figure 20.2(b) by including the variables 12-26 listed in the data description file; (c) Make a different selection, possibly based on cross-validation.

Exercise 20.18 The AL1999 dataset is from Angrist and Lavy (1999). It contains 4067 observations on classroom test scores and explanatory variables including those described in Section 20.30. In Section 20.30 we report a nonparametric instrumental variables regression of reading test scores (avgverb) on classize, disadvantaged, enrollment, and a dummy for grade=4, using the Angrist-Levy variable (20.42) as an instrument. Repeat the analysis but instead of reading test scores use math test scores (avgmath) as the dependent variable. Comment on the similarities and differences with the results for reading test scores.