8  Restricted Estimation

8.1 Introduction

In the linear projection model

Y=Xβ+eE[Xe]=0

a common task is to impose a constraint on the coefficient vector β. For example, partitioning X= (X1,X2) and β=(β1,β2) a typical constraint is an exclusion restriction of the form β2=0. In this case the constrained model is

Y=X1β1+eE[Xe]=0.

At first glance this appears the same as the linear projection model but there is one important difference: the error e is uncorrelated with the entire regressor vector X=(X1,X2) not just the included regressor X1.

In general, a set of q linear constraints on β takes the form

Rβ=c

where R is k×q,rank(R)=q<k, and c is q×1. The assumption that R is full rank means that the constraints are linearly independent (there are no redundant or contradictory constraints). We define the restricted parameter space B as the set of values of β which satisfy (8.1), that is

B={β:Rβ=c}.

Sometimes we will call (8.1) a constraint and sometimes a restriction. They are the same thing. Similarly sometimes we will call estimators which satisfy (8.1) constrained estimators and sometimes restricted estimators. They mean the same thing.

The constraint β2=0 discussed above is a special case of the constraint (8.1) with

R=(0Ik2)

a selector matrix, and c=0. Another common restriction is that a set of coefficients sum to a known constant, i.e. β1+β2=1. For example, this constraint arises in a constant-return-to-scale production function. Other common restrictions include the equality of coefficients β1=β2, and equal and offsetting coefficients β1=β2.

A typical reason to impose a constraint is that we believe (or have information) that the constraint is true. By imposing the constraint we hope to improve estimation efficiency. The goal is to obtain consistent estimates with reduced variance relative to the unconstrained estimator.

The questions then arise: How should we estimate the coefficient vector β imposing the linear restriction (8.1)? If we impose such constraints what is the sampling distribution of the resulting estimator? How should we calculate standard errors? These are the questions explored in this chapter.

8.2 Constrained Least Squares

An intuitively appealing method to estimate a constrained linear projection is to minimize the least squares criterion subject to the constraint Rβ=c.

The constrained least squares estimator is

β~cls=argminRβ=cSSE(β)

where

SSE(β)=i=1n(YiXiβ)2=YY2YXβ+βXXβ

The estimator β~cls  minimizes the sum of squared errors over all βB, or equivalently such that the restriction (8.1) holds. We call β~cls  the constrained least squares (CLS) estimator. We use the convention of using a tilde ” ” rather than a hat ” ” to indicate that β~cls  is a restricted estimator in contrast to the unrestricted least squares estimator β^ and write it as β~cls  to be clear that the estimation method is CLS.

One method to find the solution to (8.3) is the technique of Lagrange multipliers. The problem (8.3) is equivalent to finding the critical points of the Lagrangian

L(β,λ)=12SSE(β)+λ(Rβc)

over (β,λ) where λ is an s×1 vector of Lagrange multipliers. The solution is a saddlepoint. The Lagrangian is minimized over β while maximized over λ. The first-order conditions for the solution of (8.5) are

βL(β~cls,λ~cls)=XY+XXβ~cls+Rλ~cls=0

and

λL(β~cls,λ~cls)=Rβ~c=0

Premultiplying (8.6) by R(XX)1 we obtain

Rβ^+Rβ~cls+R(XX)1Rλ~cls =0

where β^=(XX)1XY is the unrestricted least squares estimator. Imposing Rβ~cls c=0 from (8.7) and solving for λ~cls we find 

λ~cls =[R(XX)1R]1(Rβ^c).

Notice that (XX)1>0 and R full rank imply that R(XX)1R>0 and is hence invertible. (See Section A.10.) Substituting this expression into (8.6) and solving for β~cls  we find the solution to the constrained minimization problem (8.3)

β~cls =β^ols (XX)1R[R(XX)1R]1(Rβ^ols c).

(See Exercise 8.5 to verify that (8.8) satisfies (8.1).)

This is a general formula for the CLS estimator. It also can be written as

β~cls=β^olsQ^XX1R(RQ^XX1R)1(Rβ^olsc).

The CLS residuals are e~i=YiXiβ~cls  and are written in vector notation as e~.

To illustrate we generated a random sample of 100 observations for the variables (Y,X1,X2) and calculated the sum of squared errors function for the regression of Y on X1 and X2. Figure 8.1 displays contour plots of the sum of squared errors function. The center of the contour plots is the least squares minimizer β^ols =(0.33,0.26). Suppose it is desired to estimate the coefficients subject to the constraint β1+β2=1. This constraint is displayed in the figure by the straight line. The constrained least squares estimator is the point on this straight line which yields the smallest sum of squared errors. This is the point which intersects with the lowest contour plot. The solution is the point where a contour plot is tangent to the constraint line and is marked as β~cls=(0.52,0.48).

Figure 8.1: Constrained Least Squares Criterion

In Stata constrained least squares is implemented using the cnsreg command.

8.3 Exclusion Restriction

While (8.8) is a general formula for CLS, in most cases the estimator can be found by applying least squares to a reparameterized equation. To illustrate let us return to the first example presented at the beginning of the chapter - a simple exclusion restriction. Recall that the unconstrained model is

Y=X1β1+X2β2+e

the exclusion restriction is β2=0, and the constrained equation is

Y=X1β1+e.

In this setting the CLS estimator is OLS of Y on X1. (See Exercise 8.1.) We can write this as

β~1=(i=1nX1iX1i)1(i=1nX1iYi).

The CLS estimator of the entire vector β=(β1,β2) is

β~=(β~10).

It is not immediately obvious but (8.8) and (8.13) are algebraically identical. To see this the first component of (8.8) with (8.2) is

β~1=(Ik20)[β^Q^XX1(0Ik2)[(0Ik2)Q^XX1(0Ik2)]1(0Ik2)β^].

Using (3.39) this equals

β~1=β^1Q^12(Q^22)1β^2=β^1+Q^1121Q^12Q^221Q^221β^2=Q^1121(Q^1YQ^12Q^221Q^2Y)+Q^1121Q^12Q^221Q^221Q^2211(Q^2yQ^21Q^111Q^1Y)=Q^1121(Q^1YQ^12Q^221Q^21Q^111Q^1Y)=Q^1121(Q^11Q^12Q^221Q^21)Q^111Q^1Y=Q^111Q^1Y

which is (8.13) as originally claimed.

8.4 Finite Sample Properties

In this section we explore some of the properties of the CLS estimator in the linear regression model

Y=Xβ+eE[eX]=0.

First, it is useful to write the estimator and the residuals as linear functions of the error vector. These are algebraic relationships and do not rely on the linear regression assumptions. Theorem 8.1 The CLS estimator satisfies

  1. Rβ^c=R(XX)1Xe

  2. β~clsβ=((XX)1XAX)e

  3. e~=(IP+XAX)e

  4. InP+XAX is symmetric and idempotent

  5. tr(InP+XAX)=nk+q

where P=X(XX)1X and A=(XX)1R(R(XX)1R)1R(XX)1

For a proof see Exercise 8.6.

Given the linearity of Theorem 8.1.2 it is not hard to show that the CLS estimator is unbiased for β.

Theorem 8.2 In the linear regression model (8.14)-(8.15) under (8.1), E[β~clsX]=β

For a proof see Exercise 8.7.

We can also calculate the covariance matrix of β~cls . First, for simplicity take the case of conditional homoskedasticity.

Theorem 8.3 In the homoskedastic linear regression model (8.14)-(8.15) with E[e2X]=σ2, under (8.1),

Vβ~0=var[β~clsX]=((XX)1(XX)1R(R(XX)1R)1R(XX)1)σ2

For a proof see Exercise 8.8.

We use the Vβ~0 notation to emphasize that this is the covariance matrix under the assumption of conditional homoskedasticity.

For inference we need an estimate of Vβ~0. A natural estimator is

V^β~0=((XX)1(XX)1R(R(XX)1R)1R(XX)1)scls2

where

scls2=1nk+qi=1ne~i2

is a biased-corrected estimator of σ2. Standard errors for the components of β are then found by taking the squares roots of the diagonal elements of V^β~, for example

s(β^j)=[V^β~0]jj.

The estimator (8.16) has the property that it is unbiased for σ2 under conditional homoskedasticity. To see this, using the properties of Theorem 8.1,

(nk+q)scls2=e~e~=e(InP+XAX)(InP+XAX)e=e(InP+XAX)e.

We defer the remainder of the proof to Exercise 8.9.

Theorem 8.4 In the homoskedastic linear regression model (8.14)-(8.15) with E[e2X]=σ2, under (8.1), E[scls 2X]=σ2 and E[V^β~0X]=Vβ~0.

Now consider the distributional properties in the normal regression model Y=Xβ+e with e N(0,σ2). By the linearity of Theorem 8.1.2, conditional on X,β~cls β is normal. Given Theorems 8.2 and 8.3 we deduce that β~clsN(β,Vβ~0).

Similarly, from Exericise 8.1 we know e~=(InP+XAX)e is linear in e so is also conditionally normal. Furthermore, since (InP+XAX)(X(XX)1XA)=0,e~ and β~cls  are uncorrelated and thus independent. Thus scls 2 and β~cls  are independent.

From (8.17) and the fact that InP+XAX is idempotent with rank nk+q it follows that

scls 2σ2χnk+q2/(nk+q).

It follows that the t-statistic has the exact distribution

T=β^jβjs(β^j)N(0,1)χnk+q2/(nk+q)tnk+q

a student t distribution with nk+q degrees of freedom.

The relevance of this calculation is that the “degrees of freedom” for CLS regression equal nk+q rather than nk as in OLS. Essentially the model has kq free parameters instead of k. Another way of thinking about this is that estimation of a model with k coefficients and q restrictions is equivalent to estimation with kq coefficients.

We summarize the properties of the normal regression model. Theorem 8.5 In the normal linear regression model (8.14)-(8.15) with constraint (8.1),

β~clsN(β,Vβ~0)(nk+q)scls2σ2χnk+q2Ttnk+q.

An interesting relationship is that in the homoskedastic regression model

cov(β^olsβ~cls,β~clsX)=E[(β^olsβ~cls)(β~clsβ)X]=E[AXee(X(XX)1XA)X]=AX(X(XX)1XA)σ2=0.

This means that β^ols β~cls  and β~cls  are conditionally uncorrelated and hence independent. A corollary is

cov(β^ols ,β~cls X)=var[β~cls X].

A second corollary is

var[β^olsβ~clsX]=var[β^olsX]var[β~clsX]=(XX)1R(R(XX)1R)1R(XX)1σ2

This also shows that the difference between the CLS and OLS variances matrices equals

var[β^ols X]var[β~clsX]=(XX)1R(R(XX)1R)1R(XX)1σ20

the final equality meaning positive semi-definite. It follows that var[β^ols X]var[β~cls X] in the positive definite sense, and thus CLS is more efficient than OLS. Both estimators are unbiased (in the linear regression model) and CLS has a lower covariance matrix (in the linear homoskedastic regression model).

The relationship (8.18) is rather interesting and will appear again. The expression says that the variance of the difference between the estimators is equal to the difference between the variances. This is rather special. It occurs generically when we are comparing an efficient and an inefficient estimator. We call (8.18) the Hausman Equality as it was first pointed out in econometrics by Hausman (1978).

8.5 Minimum Distance

The previous section explored the finite sample distribution theory under the assumptions of the linear regression model, homoskedastic regression model, and normal regression model. We now return to the general projection model where we do not impose linearity, homoskedasticity, nor normality. We are interested in the question: Can we do better than CLS in this setting?

A minimum distance estimator tries to find a parameter value satisfying the constraint which is as close as possible to the unconstrained estimator. Let β^ be the unconstrained least squares estimator, and for some k×k positive definite weight matrix W^ define the quadratic criterion function

J(β)=n(β^β)W^(β^β).

This is a (squared) weighted Euclidean distance between β^ and β.J(β) is small if β is close to β^, and is minimized at zero only if β=β^. A minimum distance estimator β~md for β minimizes J(β) subject to the constraint (8.1), that is,

β~md=argminRβ=cJ(β).

The CLS estimator is the special case when W^=Q^XX and we write this criterion function as

J0(β)=n(β^β)Q^XX(β^β).

To see the equality of CLS and minimum distance rewrite the least squares criterion as follows. Substitute the unconstrained least squares fitted equation Yi=Xiβ^+e^i into SSE(β) to obtain

SSE(β)=i=1n(YiXiβ)2=i=1n(Xiβ^+e^iXiβ)2=i=1ne^i2+(β^β)(i=1nXiXi)(β^β)=nσ^2+J0(β)

where the third equality uses the fact that i=1nXie^i=0, and the last line uses i=1nXiXi=nQ^XX. The expression (8.21) only depends on β through J0(β). Thus minimization of SSE(β) and J0(β) are equivalent, and hence β~md=β~~cls  when W^=Q^XX.

We can solve for β~md explicitly by the method of Lagrange multipliers. The Lagrangian is

L(β,λ)=12J(β,W^)+λ(Rβc).

The solution to the pair of first order conditions is

λ~md=n(RW^1R)1(Rβ^c)β~md=β^W^1R(RW^1R)1(Rβ^c).

(See Exercise 8.10.) Comparing (8.23) with (8.9) we can see that β~md specializes to β~cls  when we set W^=Q^XX

An obvious question is which weight matrix W^ is best. We will address this question after we derive the asymptotic distribution for a general weight matrix.

8.6 Asymptotic Distribution

We first show that the class of minimum distance estimators are consistent for the population parameters when the constraints are valid.

Assumption 8.1 Rβ=c where R is k×q with rank(R)=q. Assumption 8.2 W^pW>0.

Theorem 8.6 Consistency Under Assumptions 7.1, 8.1, and 8.2, β~mdpβ as n.

For a proof see Exercise 8.11.

Theorem 8.6 shows that consistency holds for any weight matrix with a positive definite limit so includes the CLS estimator.

Similarly, the constrained estimators are asymptotically normally distributed.

Theorem 8.7 Asymptotic Normality Under Assumptions 7.2, 8.1, and 8.2,

n(β~mdβ)dN(0,Vβ(W))

as n, where

Vβ(W)=VβW1R(RW1R)1RVβVβR(RW1R)1RW1+W1R(RW1R)1RVβR(RW1R)1RW1

and Vβ=QXX1ΩQXX1

For a proof see Exercise 8.12.

Theorem 8.7 shows that the minimum distance estimator is asymptotically normal for all positive definite weight matrices. The asymptotic variance depends on W. The theorem includes the CLS estimator as a special case by setting W=QXX.

Theorem 8.8 Asymptotic Distribution of CLS Estimator Under Assumptions 7.2 and 8.1, as n

n(β~clsβ)dN(0,Vcls)

where

Vcls=VβQXX1R(RQXX1R)1RVβVβR(RQXX1R)1RQXX1+QXX1R(RQXX1R)1RVβR(RQXX1R)1RQXX1

For a proof see Exercise 8.13.

8.7 Variance Estimation and Standard Errors

Earlier we introduced the covariance matrix estimator under the assumption of conditional homoskedasticity. We now introduce an estimator which does not impose homoskedasticity.

The asymptotic covariance matrix Vcls  may be estimated by replacing Vβ with a consistent estimator such as V^β. A more efficient estimator can be obtained by using the restricted coefficient estimator which we now show. Given the constrained least squares squares residuals e~i=YiXiβ~cls  we can estimate the matrix Ω=E[XXe2] by

Ω~=1nk+qi=1nXiXie~i2.

Notice that we have used an adjusted degrees of freedom. This is an ad hoc adjustment designed to mimic that used for estimation of the error variance σ2. The moment estimator of Vβ is

V~β=Q^XX1Ω~Q^XX1

and that for Vcls is

V~cls=V~βQ^XX1R(RQ^XX1R)1RV~βV~βR(RQ^XX1R)1RQ^xx1+Q^XX1R(RQ^XX1R)1RV~βR(RQ^XX1R)1RQ^XX1

We can calculate standard errors for any linear combination hβ~cls  such that h does not lie in the range space of R. A standard error for hβ~ is

s(hβ~cls)=(n1hV~clsh)1/2.

8.8 Efficient Minimum Distance Estimator

Theorem 8.7 shows that minimum distance estimators, which include CLS as a special case, are asymptotically normal with an asymptotic covariance matrix which depends on the weight matrix W. The asymptotically optimal weight matrix is the one which minimizes the asymptotic variance Vβ(W). This turns out to be W=Vβ1 as is shown in Theorem 8.9 below. Since Vβ1 is unknown this weight matrix cannot be used for a feasible estimator but we can replace Vβ1 with a consistent estimator V^β1 and the asymptotic distribution (and efficiency) are unchanged. We call the minimum distance estimator with W^=V^β1 the efficient minimum distance estimator and takes the form

β~emd =β^V^βR(RV^βR)1(Rβ^c).

The asymptotic distribution of (8.25) can be deduced from Theorem 8.7. (See Exercises 8.14 and 8.15, and the proof in Section 8.16.)

Theorem 8.9 Efficient Minimum Distance Estimator Under Assumptions 7.2 and 8.1,

n(β~emdβ)dN(0,Vβ,emd)

as n, where

Vβ,emd=VβVβR(RVβR)1RVβ

Since

Vβ,emdVβ

the estimator (8.25) has lower asymptotic variance than the unrestricted estimator. Furthermore, for any W,

Vβ,emdVβ(W)

so (8.25) is asymptotically efficient in the class of minimum distance estimators.

Theorem 8.9 shows that the minimum distance estimator with the smallest asymptotic variance is (8.25). One implication is that the constrained least squares estimator is generally inefficient. The interesting exception is the case of conditional homoskedasticity in which case the optimal weight matrix is W=(Vβ0)1 so in this case CLS is an efficient minimum distance estimator. Otherwise when the error is conditionally heteroskedastic there are asymptotic efficiency gains by using minimum distance rather than least squares.

The fact that CLS is generally inefficient is counter-intuitive and requires some reflection. Standard intuition suggests to apply the same estimation method (least squares) to the unconstrained and constrained models and this is the common empirical practice. But Theorem 8.9 shows that this is inefficient. Why? The reason is that the least squares estimator does not make use of the regressor X2. It ignores the information E[X2e]=0. This information is relevant when the error is heteroskedastic and the excluded regressors are correlated with the included regressors.

Inequality (8.27) shows that the efficient minimum distance estimator β~emd  has a smaller asymptotic variance than the unrestricted least squares estimator β^. This means that efficient estimation is attained by imposing correct restrictions when we use the minimum distance method.

8.9 Exclusion Restriction Revisited

We return to the example of estimation with a simple exclusion restriction. The model is

Y=X1β1+X2β2+e

with the exclusion restriction β2=0. We have introduced three estimators of β1. The first is unconstrained least squares applied to (8.10) which can be written as β^1=Q^1121Q^1Y2. From Theorem 7.25 and equation (7.14) its asymptotic variance is

avar[β^1]=Q1121(Ω11Q12Q221Ω21Ω12Q221Q21+Q12Q221Ω22Q221Q21)Q1121

The second estimator of β1 is CLS, which can be written as β~1=Q^111Q^1Y. Its asymptotic variance can be deduced from Theorem 8.8, but it is simpler to apply the CLT directly to show that

avar[β~1]=Q111Ω11Q111.

The third estimator of β1 is efficient minimum distance. Applying (8.25), it equals

β¯1=β^1V^12V^221β^2

where we have partitioned

V^β=[V^11V^12V^21V^22]

From Theorem 8.9 its asymptotic variance is

avar[β¯1]=V11V12V221V21

See Exercise 8.16 to verify equations (8.29), (8.30), and (8.31).

In general the three estimators are different and they have different asymptotic variances. It is instructive to compare the variances to assess whether or not the constrained estimator is more efficient than the unconstrained estimator.

First, assume conditional homoskedasticity. In this case the two covariance matrices simplify to avar[β^1]=σ2Q1121 and avar[β~1]=σ2Q111. If Q12=0 (so X1 and X2 are uncorrelated) then these two variance matrices are equal and the two estimators have equal asymptotic efficiency. Otherwise, since Q12Q221Q210, then Q11Q11Q12Q221Q21 and consequently

Q111σ2(Q11Q12Q221Q21)1σ2.

This means that under conditional homoskedasticity β~1 has a lower asymptotic covariance matrix than β^1. Therefore in this context constrained least squares is more efficient than unconstrained least squares. This is consistent with our intuition that imposing a correct restriction (excluding an irrelevant regressor) improves estimation efficiency.

However, in the general case of conditional heteroskedasticity this ranking is not guaranteed. In fact what is really amazing is that the variance ranking can be reversed. The CLS estimator can have a larger asymptotic variance than the unconstrained least squares estimator.

To see this let’s use the simple heteroskedastic example from Section 7.4. In that example, Q11= Q22=1,Q12=12,Ω11=Ω22=1, and Ω12=78. We can calculate (see Exercise 8.17) that Q112=34 and

avar[β^1]=23avar[β~1]=1avar[β¯1]=58.

Thus the CLS estimator β~1 has a larger variance than the unrestricted least squares estimator β^1 ! The minimum distance estimator has the smallest variance of the three, as expected.

What we have found is that when the estimation method is least squares, deleting the irrelevant variable X2 can actually increase estimation variance, or equivalently, adding an irrelevant variable can decrease the estimation variance. To repeat this unexpected finding, we have shown that it is possible for least squares applied to the short regression (8.11) to be less efficient for estimation of β1 than least squares applied to the long regression (8.10) even though the constraint β2=0 is valid! This result is strongly counter-intuitive. It seems to contradict our initial motivation for pursuing constrained estimation - to improve estimation efficiency.

It turns out that a more refined answer is appropriate. Constrained estimation is desirable but not necessarily CLS. While least squares is asymptotically efficient for estimation of the unconstrained projection model it is not an efficient estimator of the constrained projection model.

8.10 Variance and Standard Error Estimation

We have discussed covariance matrix estimation for CLS but not yet for the EMD estimator.

The asymptotic covariance matrix (8.26) may be estimated by replacing Vβ with a consistent estimator. It is best to construct the variance estimate using β~emd. . The EMD residuals are e~i=YiXiβ~emd . Using these we can estimate the matrix Ω=E[XXe2] by

Ω~=1nk+qi=1nXiXie~i2

Following the formula for CLS we recommend an adjusted degrees of freedom. Given Ω~ the moment estimator of Vβ is V~β=Q^XX1Ω~Q^XX1. Given this, we construct the variance estimator

V~β,emd=V~βV~βR(RV~βR)1RV~β.

A standard error for hβ~ is then

s(hβ~)=(n1hV~β, emd h)1/2.

8.11 Hausman Equality

Form (8.25) we have

n(β^olsβ~emd)=V^βR(RV^βR)1n(Rβ^olsc) Nd(0,VβR(RVβR)1RVβ)

It follows that the asymptotic variances of the estimators satisfy the relationship

avar[β^olsβ~emd]=avar[β^ols]avar[β~emd].

We call (8.37) the Hausman Equality: the asymptotic variance of the difference between an efficient and another estimator is the difference in the asymptotic variances.

8.12 Example: Mankiw, Romer and Weil (1992)

We illustrate the methods by replicating some of the estimates reported in a well-known paper by Mankiw, Romer, and Weil (1992). The paper investigates the implications of the Solow growth model using cross-country regressions. A key equation in their paper regresses the change between 1960 and 1985 in log GDP per capita on (1) log GDP in 1960, (2) the log of the ratio of aggregate investment to Table 8.1: Estimates of Solow Growth Model

β^ols  β^cls  β^emd
logGDP1960 0.29 0.30 0.30
(0.05) (0.05) (0.05)
logI GDP  0.52 0.50 0.46
(0.11) (0.09) (0.08)
log(n+g+δ) 0.51 0.74 0.71
(0.24) (0.08) (0.07)
log( School ) 0.23 0.24 0.25
(0.07) (0.07) (0.06)
Intercept 3.02 2.46 2.48
(0.74) (0.44) (0.44)

Standard errors are heteroskedasticity-consistent

GDP, (3) the log of the sum of the population growth rate n, the technological growth rate g, and the rate of depreciation δ, and (4) the log of the percentage of the working-age population that is in secondary school (School), the latter a proxy for human-capital accumulation.

The data is available on the textbook webpage in the file MRW1992.

The sample is 98 non-oil-producing countries and the data was reported in the published paper. As g and δ were unknown the authors set g+δ=0.05. We report least squares estimates in the first column of Table 8.1. The estimates are consistent with the Solow theory due to the positive coefficients on investment and human capital and negative coefficient for population growth. The estimates are also consistent with the convergence hypothesis (that income levels tend towards a common mean over time) as the coefficient on intial GDP is negative.

The authors show that in the Solow model the 2nd,3rd and 4th coefficients sum to zero. They reestimated the equation imposing this constraint. We present constrained least squares estimates in the second column of Table 8.1 and efficient minimum distance estimates in the third column. Most of the coefficients and standard errors only exhibit small changes by imposing the constraint. The one exception is the coefficient on log population growth which increases in magnitude and its standard error decreases substantially. The differences between the CLS and EMD estimates are modest.

We now present Stata, R and MATLAB code which implements these estimates.

You may notice that the Stata code has a section which uses the Mata matrix programming language. This is used because Stata does not implement the efficient minimum distance estimator, so needs to be separately programmed. As illustrated here, the Mata language allows a Stata user to implement methods using commands which are quite similar to MATLAB.

8.13 Misspecification

What are the consequences for a constrained estimator β~ if the constraint (8.1) is incorrect? To be specific suppose that the truth is

Rβ=c

where c is not necessarily equal to c.

This situation is a generalization of the analysis of “omitted variable bias” from Section 2.24 where we found that the short regression (e.g. (8.12)) is estimating a different projection coefficient than the long regression (e.g. (8.10)).

One answer is to apply formula (8.23) to find that

β~mdpβmd=βW1R(RW1R)1(cc).

The second term, W1R(RW1R)1(cc), shows that imposing an incorrect constraint leads to inconsistency - an asymptotic bias. We can call the limiting value βmd the minimum-distance projection coefficient or the pseudo-true value implied by the restriction.

However, we can say more.

For example, we can describe some characteristics of the approximating projections. The CLS estimator projection coefficient has the representation

βcls=argminRβ=cE[(YXβ)2],

the best linear predictor subject to the constraint (8.1). The minimum distance estimator converges in probability to

βmd=argminRβ=c(ββ0)W(ββ0)

where β0 is the true coefficient. That is, βmd is the coefficient vector satisfying (8.1) closest to the true value in the weighted Euclidean norm. These calculations show that the constrained estimators are still reasonable in the sense that they produce good approximations to the true coefficient conditional on being required to satisfy the constraint.

We can also show that β~md has an asymptotic normal distribution. The trick is to define the pseudotrue value

βn=βW^1R(RW^1R)1(cc).

(Note that (8.38) and (8.39) are different!) Then

n(β~mdβn)=n(β^β)W^1R(RW^1R)1n(Rβ^c)=(IkW^1R(RW^1R)1R)n(β^β)d(IkW1R(RW1R)1R)N(0,Vβ)=N(0,Vβ(W))

In particular

n(β~emdβn)dN(0,Vβ).

This means that even when the constraint (8.1) is misspecified the conventional covariance matrix estimator (8.35) and standard errors (8.36) are appropriate measures of the sampling variance though the distributions are centered at the pseudo-true values (projections) βn rather than β. The fact that the estimators are biased is an unavoidable consequence of misspecification.

An alternative approach to the asymptotic distribution theory under misspecification uses the concept of local alternatives. It is a technical device which might seem a bit artificial but it is a powerful method to derive useful distributional approximations in a wide variety of contexts. The idea is to index the true coefficient βn by n via the relationship

Rβn=c+δn1/2.

for some δRq. Equation (8.41) specifies that βn violates (8.1) and thus the constraint is misspecified. However, the constraint is “close” to correct as the difference Rβnc=δn1/2 is “small” in the sense that it decreases with the sample size n. We call (8.41) local misspecification.

The asymptotic theory is derived as n under the sequence of probability distributions with the coefficients βn. The way to think about this is that the true value of the parameter is βn and it is “close” to satisfying (8.1). The reason why the deviation is proportional to n1/2 is because this is the only choice under which the localizing parameter δ appears in the asymptotic distribution but does not dominate it. The best way to see this is to work through the asymptotic approximation.

Since βn is the true coefficient value, then Y=Xβn+e and we have the standard representation for the unconstrained estimator, namely

n(β^βn)=(1ni=1nXiXi)1(1ni=1nXiei)dN(0,Vβ).

There is no difference under fixed (classical) or local asymptotics since the right-hand-side is independent of the coefficient βn.

A difference arises for the constrained estimator. Using (8.41), c=Rβnδn1/2 so

Rβ^c=R(β^βn)+δn1/2

and

β~md=β^W^1R(RW^1R)1(Rβ^c)=β^W^1R(RW^1R)1R(β^βn)+W^1R(RW^1R)1δn1/2.

It follows that

n(β~mdβn)=(IkW^1R(RW^1R)1R)n(β^βn)+W^1R(RW^1R)1δ.

The first term is asymptotically normal (from 8.42)). The second term converges in probability to a constant. This is because the n1/2 local scaling in (8.41) is exactly balanced by the n scaling of the estimator. No alternative rate would have produced this result.

Consequently we find that the asymptotic distribution equals

n(β~mdβn)dN(0,Vβ)+W1R(RW1R)1δ=N(δ,Vβ(W))

where δ=W1R(RW1R)1δ

The asymptotic distribution (8.43) is an approximation of the sampling distribution of the restricted estimator under misspecification. The distribution (8.43) contains an asymptotic bias component δ. The approximation is not fundamentally different from (8.40) - they both have the same asymptotic variances and both reflect the bias due to misspecification. The difference is that (8.40) puts the bias on the left-side of the convergence arrow while (8.43) has the bias on the right-side. There is no substantive difference between the two. However, (8.43) is more convenient for some purposes such as the analysis of the power of tests as we will explore in the next chapter.

8.14 Nonlinear Constraints

In some cases it is desirable to impose nonlinear constraints on the parameter vector β. They can be written as

r(β)=0

where r:RkRq. This includes the linear constraints (8.1) as a special case. An example of (8.44) which cannot be written as (8.1) is β1β2=1, which is (8.44) with r(β)=β1β21.

The constrained least squares and minimum distance estimators of β subject to (8.44) solve the minimization problems

β~cls=argminSSE(β)r(β)=0β~md=argminr(β)=0J(β)

where SSE(β) and J(β) are defined in (8.4) and (8.19), respectively. The solutions solve the Lagrangians

L(β,λ)=12SSE(β)+λr(β)

or

L(β,λ)=12J(β)+λr(β)

over(β,λ)

Computationally there is no general closed-form solution so they must be found numerically. Algorithms to numerically solve (8.45) and (8.46) are known as constrained optimization methods and are available in programming languages including MATLAB and R. See Chapter 12 of Probability and Statistics for Economists.

Assumption 8.3

  1. r(β)=0.

  2. r(β) is continuously differentiable at the true β.

  3. rank(R)=q, where R=βr(β).

The asymptotic distribution is a simple generalization of the case of a linear constraint but the proof is more delicate. Theorem 8.10 Under Assumptions 7.2, 8.2, and 8.3, for β~=β~md and β~=β~cls  defined in (8.45) and (8.46),

n(β~β)dN(0,Vβ(W))

as n where Vβ(W) is defined in (8.24). For β~cls ,W=QXX and Vβ(W)= Vcls  as defined in Theorem 8.8. Vβ(W) is minimized with W=Vβ1 in which case the asymptotic variance is

Vβ=VβVβR(RVβR)1RVβ.

The asymptotic covariance matrix for the efficient minimum distance estimator can be estimated by

V^β=V^βV^βR^(R^V^βR^1R^V^β

where

R^=βr(β~md).

Standard errors for the elements of β~md are the square roots of the diagonal elements of V^β~=n1V^β.

8.15 Inequality Restrictions

Inequality constraints on the parameter vector β take the form

r(β)0

for some function r:RkRq. The most common example is a non-negative constraint β10.

The constrained least squares and minimum distance estimators can be written as

β~cls=argminr(β)0SSE(β)

and

β~md=argminr(β)0J(β).

Except in special cases the constrained estimators do not have simple algebraic solutions. An important exception is when there is a single non-negativity constraint, e.g. β10 with q=1. In this case the constrained estimator can be found by the following approach. Compute the uncontrained estimator β^. If β^10 then β~=β^. Otherwise if β^1<0 then impose β1=0 (eliminate the regressor X1 ) and re-estimate. This method yields the constrained least squares estimator. While this method works when there is a single non-negativity constraint, it does not immediately generalize to other contexts.

The computation problems (8.50) and (8.51) are examples of quadratic programming. Quick computer algorithms are available in programming languages including MATLAB and R.

Inference on inequality-constrained estimators is unfortunately quite challenging. The conventional asymptotic theory gives rise to the following dichotomy. If the true parameter satisfies the strict inequality r(β)>0 then asymptotically the estimator is not subject to the constraint and the inequalityconstrained estimator has an asymptotic distribution equal to the unconstrained case. However if the true parameter is on the boundary, e.g., r(β)=0, then the estimator has a truncated structure. This is easiest to see in the one-dimensional case. If we have an estimator β^ which satisfies n(β^β)dZ= N(0,Vβ) and β=0, then the constrained estimator β~=max[β^,0] will have the asymptotic distribution nβ~dmax[Z,0], a “half-normal” distribution.

8.16 Technical Proofs*

Proof of Theorem 8.9, equation (8.28) Let R be a full rank k×(kq) matrix satisfying RVβR=0 and then set C=[R,R] which is full rank and invertible. Then we can calculate that

CVβC=[RVβRRVβRRVβRRVβR]=[000RVβR]

and

CVβ(W)C=[RVβ(W)RRVβ(W)RRVβ(W)RRVβ(W)R]=[000RVβR+RWR(RWR)1RVβR(RWR)1RWR].

Thus

C(Vβ(W)Vβ)C=CVβ(W)CCVβC=[000RWR(RWR)1RVβR(RWR)1RWR]0

Since C is invertible it follows that Vβ(W)Vβ0 which is (8.28).

Proof of Theorem 8.10 We show the result for the minimum distance estimator β~=β~md as the proof for the constrained least squares estimator is similar. For simplicity we assume that the constrained estimator is consistent β~ppβ. This can be shown with more effort, but requires a deeper treatment than appropriate for this textbook.

For each element rj(β) of the q-vector r(β), by the mean value theorem there exists a βj on the line segment joining β~ and β such that

rj(β~)=rj(β)+βrj(βj)(β~β).

Let Rn be the k×q matrix

R=[βr1(β1)βr2(β2)βrq(βq)]

Since β~ppβ it follows that βjpβ, and by the CMT, RpR. Stacking the (8.52), we obtain

r(β~)=r(β)+R(β~β).

Since r(β~)=0 by construction and r(β)=0 by Assumption 8.1 this implies

0=R(β~β).

The first-order condition for (8.47) is W^(β^β~)=R^λ~ where R^ is defined in (8.48). Premultiplying by RW^1, inverting, and using (8.53), we find

λ~=(RW^1R^)1R(β^β~)=(RW^1R^)1R(β^β).

Thus

β~β=(IkW^1R^(RnW^1R^)1Rn)(β^β).

From Theorem 7.3 and Theorem 7.6 we find

n(β~β)=(IkW^1R^(RnW^1R~)1Rn)n(β^β)d(IkW1R(RW1R)1R)N(0,Vβ)=N(0,Vβ(W))

8.17 Exercises

Exercise 8.1 In the model Y=X1β1+X2β2+e, show directly from definition (8.3) that the CLS estimator of β=(β1,β2) subject to the constraint that β2=0 is the OLS regression of Y on X1.

Exercise 8.2 In the model Y=X1β1+X2β2+e, show directly from definition (8.3) that the CLS estimator of β=(β1,β2) subject to the constraint β1=c (where c is some given vector) is OLS of YX1c on X2.

Exercise 8.3 In the model Y=X1β1+X2β2+e, with β1 and β2 each k×1, find the CLS estimator of β=(β1,β2) subject to the constraint that β1=β2.

Exercise 8.4 In the linear projection model Y=α+Xβ+e consider the restriction β=0.

  1. Find the CLS estimator of α under the restriction β=0.

  2. Find an expression for the efficient minimum distance estimator of α under the restriction β=0.

Exercise 8.5 Verify that for β~cls defined in (8.8) that Rβ~cls=c.

Exercise 8.6 Prove Theorem 8.1.

Exercise 8.7 Prove Theorem 8.2, that is, E[β~cls X]=β, under the assumptions of the linear regression regression model and (8.1). (Hint: Use Theorem 8.1.) Exercise 8.8 Prove Theorem 8.3.

Exercise 8.9 Prove Theorem 8.4. That is, show E[scls2X]=σ2 under the assumptions of the homoskedastic regression model and (8.1).

Exercise 8.10 Verify (8.22), (8.23), and that the minimum distance estimator β~md with W^=Q^XX equals the CLS estimator.

Exercise 8.11 Prove Theorem 8.6.

Exercise 8.12 Prove Theorem 8.7.

Exercise 8.13 Prove Theorem 8.8. (Hint: Use that CLS is a special case of Theorem 8.7.)

Exercise 8.14 Verify that (8.26) is Vβ(W) with W=Vβ1.

Exercise 8.15 Prove (8.27). Hint: Use (8.26).

Exercise 8.16 Verify (8.29), (8.30) and (8.31).

Exercise 8.17 Verify (8.32), (8.33), and (8.34).

Exercise 8.18 Suppose you have two independent samples each with n observations which satisfy the models Y1=X1β1+e1 with E[X1e1]=0 and Y2=X2β2+e2 with E[X2e2]=0 where β1 and β2 are both k×1. You estimate β1 and β2 by OLS on each sample, with consistent asymptotic covariance matrix estimators V^β1 and V^β2. Consider efficient minimum distance estimation under the restriction β1=β2.

  1. Find the estimator β~ of β=β1=β2.

  2. Find the asymptotic distribution of β~.

  3. How would you approach the problem if the sample sizes are different, say n1 and n2 ?

Exercise 8.19 Use the cps09mar dataset and the subsample of white male Hispanics.

  1. Estimate the regression

log( wage )^=β1 education +β2 experience +β3 experience 2/100+β4 married 1+β5 married 2+β6 married 3+β7 widowed +β8 divorced +β9 separated +β10

where married 1, married 2, and married 3 are the first three marital codes listed in Section 3.22.

  1. Estimate the equation by CLS imposing the constraints β4=β7 and β8=β9. Report the estimates and standard errors.

  2. Estimate the equation using efficient minimum distance imposing the same constraints. Report the estimates and standard errors.

  3. Under what constraint on the coefficients is the wage equation non-decreasing in experience for experience up to 50 ?

  4. Estimate the equation imposing β4=β7,β8=β9, and the inequality from part (d). Exercise 8.20 Take the model

Y=m(X)+em(x)=β0+β1x+β2x2++βpxpE[Xje]=0,j=0,,pg(x)=ddxm(x)

with i.i.d. observations (Yi,Xi),i=1,,n. The order of the polynomial p is known.

  1. How should we interpret the function m(x) given the projection assumption? How should we interpret g(x) ? (Briefly)

  2. Describe an estimator g^(x) of g(x).

  3. Find the asymptotic distribution of n(g^(x)g(x)) as n.

  4. Show how to construct an asymptotic 95% confidence interval for g(x) (for a single x ).

  5. Assume p=2. Describe how to estimate g(x) imposing the constraint that m(x) is concave.

  6. Assume p=2. Describe how to estimate g(x) imposing the constraint that m(u) is increasing on the region u[xL,xU]

Exercise 8.21 Take the linear model with restrictions Y=Xβ+e with E[Xe]=0 and Rβ=c. Consider three estimators for β :

  • β^ the unconstrained least squares estimator

  • β~ the constrained least squares estimator

  • β¯ the constrained efficient minimum distance estimator

For the three estimator define the residuals e^i=YiXiβ^,e~i=YiXiβ~,e¯i=YiXiβ¯, and variance estimators σ^2=n1i=1ne^i2,σ~2=n1i=1ne~i2, and σ¯2=n1i=1ne¯i2.

  1. As β¯ is the most efficient estimator and β^ the least, do you expect σ¯2<σ~2<σ^2 in large samples?

  2. Consider the statistic

Tn=σ^2i=1n(e^ie~i)2.

Find the asymptotic distribution for Tn when Rβ=c is true.

  1. Does the result of the previous question simplify when the error ei is homoskedastic?

Exercise 8.22 Take the linear model Y=X1β1+X2β2+e with E[Xe]=0. Consider the restriction β1β2=2.

  1. Find an explicit expression for the CLS estimator β~=(β~1,β~2) of β=(β1,β2) under the restriction. Your answer should be specific to the restriction. It should not be a generic formula for an abstract general restriction.

  2. Derive the asymptotic distribution of β~1 under the assumption that the restriction is true.