22  M-Estimators

22.1 Introduction

So far in this textbook we have primarily focused on estimators which have explicit algebraic expressions. However, many econometric estimators need to be calculated by numerical methods. These estimators are collectively described as nonlinear. Many fall in a broad class known as m-estimators. In this part of the textbook we describe a number of m-estimators in wide use in econometrics. They have a common structure which allows for a unified treatment of estimation and inference.

An m-estimator is defined as a minimizer of a sample average

θ^=argminθΘSn(θ)Sn(θ)=1ni=1nρ(Yi,Xi,θ)

where ρ(Y,X,θ) is some function of (Y,X) and a parameter θΘ. The function Sn(θ) is called the criterion function or objective function. For notational simplicity set ρi(θ)=ρ(Yi,Xi,θ).

This includes maximum likelihood when ρi(θ) is the negative log-density function. “m-estimators” are a broader class; the prefix “m” stands for “maximum likelihood-type”.

The issues we focus on in this chaper are: (1) identification; (2) estimation; (3) consistency; (4) asymptotic distribution; and (5) covariance matrix estimation.

22.2 Examples

There are many m-estimators in common econometric usage. Some examples include the following.

  1. Ordinary Least Squares: ρi(θ)=(YiXiθ)2.

  2. Nonlinear Least Squares: ρi(θ)=(Yim(Xi,θ))2 (Chapter 23).

  3. Least Absolute Deviations: ρi(θ)=|YiXiθ| (Chapter 24).

  4. Quantile Regression: ρi(θ)=(YiXiθ)(τ1{(YiXiθ)<0}) (Chapter 24).

  5. Maximum Likelihood: ρi(θ)=logf(YiXi,θ). The final category - Maximum Likelihood Estimation - includes many estimators as special cases. This includes many standard estimators of limited-dependent-variable models (Chapters 25-27). To illustrate, the probit model for a binary dependent variable is

P[Y=1X]=Φ(Xθ)

where Φ(u) is the normal cumulative distribution function. We will study probit estimation in detail in Chapter 25. The negative log-density function is

ρi(θ)=Yilog(Φ(Xiθ))(1Yi)log(1Φ(Xiθ)).

Not all nonlinear estimators are m-estimators. Examples include method of moments, GMM, and minimum distance.

22.3 Identification and Estimation

A parameter vector θ is identified if it is uniquely determined by the probability distribution of the observations. This is a property of the probability distribution, not of the estimator.

However, when discussing a specific estimator it is common to describe identification in terms of the criterion function. Assume E|ρ(Y,X,θ)|<. Define

S(θ)=E[Sn(θ)]=E[ρ(Y,X,θ)]

and its population minimizer

θ0=argminθΘS(θ).

We say that θ is identified (or point identified) by S(θ) if the minimizer θ0 is unique.

In nonlinear models it is difficult to provide general conditions under which a parameter is identified. Identification needs to be examined on a model-by-model basis.

An m-estimator θ^ by definition minimizes Sn(θ). When there is no explicit algebraic expression for the solution the minimization is done numerically. Such numerical methods are reviewed in Chapter 12 of Probability and Statistics for Economists.

We illustrate using the probit model of the previous section. We use the CPS dataset for Y equal to an indicator that the individual is married 1, and set the regressors equal to years of education, age, and age squared. We obtain the following estimates

Standard error calculation will be discussed in Section 22.8. In this application we see that the probability of marriage is increasing in years of education and is an increasing yet concave function of age.

22.4 Consistency

It seems reasonable to expect that if a parameter is identified then we should be able to estimate the parameter consistently. For linear estimators we demonstrated consistency by applying the WLLN to the

1 We define married =1 if marital equals 1,2 , or 3. explicit algebraic expressions for the estimators. This is not possible for nonlinear estimators because they do not have explicit algebraic expressions.

Instead, what is available to us is that an m-estimator minimizes the criterion function Sn(θ) which is itself a sample average. For any given θ the WLLN shows that Sn(θ)pS(θ). It is intuitive that the minimizer of Sn(θ) (the m-estimator θ^ ) will converge in probability to the minimizer of S(θ) (the parameter θ0 ). However, the WLLN by itself is not sufficient to make this extension.

  1. Non-Uniform Convergence

  1. Uniform Convergence

Figure 22.1: Non-Uniform vs. Uniform Convergence

To see the problem examine Figure 22.1(a). This displays a sequence of functions Sn(θ) (the dashed lines) for three values of n. What is illustrated is that for each θ the function Sn(θ) converges towards the limit function S(θ). However for each n the function Sn(θ) has a severe dip in the right-hand region. The result is that the sample minimizer θ^n converges to the right-limit of the parameter space. In contrast, the minimizer θ0 of the limit criterion S(θ) is in the interior of the parameter space. What we observe is that Sn(θ) converges to S(θ) for each θ but the minimizer θ^n does not converge to θ0.

A sufficient condition to exclude this pathological behavior is uniform convergence- uniformity over the parameter space Θ. As we show in Theorem 22.1, uniform convergence in probability of Sn(θ) to S(θ) is sufficient to establish that the m-estimator θ^ is consistent for θ0.

Definition 22.1 Sn(θ) converges in probability to S(θ) uniformly over θΘ if

supθΘ|Sn(θ)S(θ)|p0

as n

Uniform convergence excludes erratic wiggles in Sn(θ) uniformly across θ and n (e.g., what occurs in Figure 22.1(a)). The idea is illustrated in Figure 22.1(b). The heavy solid line is the function S(θ). The dashed lines are S(θ)+ε and S(θ)ε. The thin solid line is the sample criterion Sn(θ). The figure illustrates a situation where the sample criterion satisifes supθΘ|Sn(θ)S(θ)|<ε. The sample criterion as displayed weaves up and down but stays within ε of S(θ). Uniform convergence holds if the event shown in Figure 22.1(b) holds with high probability for n sufficiently large, for any arbitrarily small ε.

Theorem 22.1θ^pθ0 as n if

  1. Sn(θ) converges in probability to S(θ) uniformly over θΘ.

  2. θ0 uniquely minimizes S(θ) in the sense that for all ϵ>0,

infθ:θθ0ϵS(θ)>S(θ0).

Theorem 22.1 shows that an m-estimator is consistent for its population parameter. There are only two conditions. First, the criterion function converges uniformly in probability to its expected value, and second, the minimizer θ0 is unique. The assumption excludes the possibility that limjS(θj)=S(θ0 ) for some sequence θjΘ not converging to θ0.

The proof of Theorem 22.1 is provided in Section 22.9.

22.5 Uniform Law of Large Numbers

The uniform convergence of Definition 22.1 is a high-level assumption. In this section we provide lower level sufficient conditions.

Theorem 22.2 Uniform Law of Large Numbers (ULLN) Assume

  1. (Yi,Xi) are i.i.d.

  2. ρ(Y,X,θ) is continuous in θΘ with probability one.

  3. |ρ(Y,X,θ)|G(Y,X) where E[G(Y,X)]<.

  4. Θ is compact.

Then supθΘ|Sn(θ)S(θ)|p0

Theorem 22.2 is established in Theorem 18.2 of Probability and Statistics for Economists.

Assumption 2 holds if ρ(y,x,θ) is continuous in θ, or if the discontinuities occur at points of zero probability. This allows for most relevant applications in econometrics. Theorem 18.2 of Probability and Statistics for Economists also provides conditions based on finite bracketing or covering numbers which allow for more generality. Assumption 3 is a slight strengthening of the finite-expectation condition E[ρ(Y,X,θ)]<. The function G(Y,X) is called an envelope. The ULLN extends to time series and clustered samples. See B. E. Hansen and S. Lee (2019) for clustered samples.

Combining Theorems 22.1 and 22.2 we obtain a set of conditions for consistent estimation.

Theorem 22.3θ^pθ0 as n if

  1. (Yi,Xi) are i.i.d.

  2. ρ(Y,X,θ) is continuous in θΘ with probability one.

  3. |ρ(Y,X,θ)|G(Y,X) where E[G(Y,X)]<.

  4. Θ is compact.

  5. θ0 uniquely minimizes S(θ).

22.6 Asymptotic Distribution

We now establish an asymptotic distribution theory. We start by an informal demonstration, present a general result under high-level conditions, and then discuss the assumptions and conditions. Define

ψ(Y,X,θ)=θρ(Y,X,θ)ψ¯n(θ)=θSn(θ)ψ(θ)=θS(θ).

Also define ψi(θ)=ψ(Yi,Xi,θ) and ψi=ψi(θ0).

Since the m-estimator θ^ minimizes Sn(θ) it satisfies 2 the first-order condition 0=ψ¯n(θ^). Expand the right-hand side as a first order Taylor expansion about θ0. This is valid when θ^ is in a neighborhood of θ0, which holds for n sufficiently large by Theorem 22.1. This yields

0=ψ¯n(θ^)ψ¯n(θ0)+2θθSn(θ0)(θ^θ0).

Rewriting, we obtain

n(θ^θ0)(2θθSn(θ0))1(nψ¯n(θ0)).

Consider the two components. First, by the WLLN

2θθSn(θ0)=1ni=1n2θθρ(Yi,Xi,θ0)pE[2θθρi(Y,X,θ0)]= def Q

2 If θ^ is an interior solution. Since θ^ is consistent this occurs with probability approaching one if θ0 is in the interior of the parameter space Θ. Second,

nψ¯n(θ0)=1ni=1nψi.

Since θ0 minimizes S(θ)=E[ρi(θ)] it satisfies the first-order condition

0=ψ(θ0)=E[ψ(Y,X,θ0)].

Thus the summands in (22.2) are mean zero. Applying a CLT this sum converges in distribution to N(0,Ω) where Ω=E[ψiψi]. We deduce that

n(θ^θ0)dQ1 N(0,Ω)=N(0,Q1ΩQ1).

The technical hurdle to make this derivation rigorous is justifying the Taylor expansion (22.1). This can be done through smoothness of the second derivative of ρi(θ0). An alternative (more advanced) argument based on empirical process theory uses weaker assumptions. Set

Q(θ)=2θθS(θ)Q=Q(θ0)

Let N be some neighborhood of θ0.

Theorem 22.4 Assume the conditions of Theorem 22.1 hold, plus

  1. Eψ(Y,X,θ0)2<

  2. Q>0.

  3. Q(θ) is continuous in θN.

  4. For all θ1,θ2N,ψ(Y,X,θ1)ψ(Y,X,θ2)B(Y,X)θ1θ2 where E[B(Y,X)2]<

  5. θ0 is in the interior of Θ.

Then as n,n(θ^θ0)dN(0,V) where V=Q1ΩQ1.

The proof of Theorem 22.4 is presented in Section 22.9.

In some cases the asymptotic covariance matrix simplifies. The leading case is correctly specified maximum likelihood estimation, where Q=Ω so V=Q1=Ω1.

Assumption 1 states that the scores ψ(Y,X,θ0) have a finite second moment. This is necessary in order to apply the CLT. Assumption 2 is a full-rank condition and is related to identification. A sufficient condition for Assumption 3 is that the scores ψ(Y,X,θ) are continuously differentiable but this is not necessary. Assumption 3 is broader, allowing for discontinuous ψ(Y,X,θ), so long as its expectation is continuous and differentiable. Assumption 4 states that ψ(Y,X,θ) is Lipschitz-continuous for θ near θ0. Assumption 5 is required in order to justify the application of the mean-value expansion.

22.7 Asymptotic Distribution Under Broader Conditions*

Assumption 4 in Theorem 22.4 requires that ψ(Y,X,θ) is Lipschitz-continuous. While this holds in most applications, it is violated in some important applications including quantile regression. In such cases we can appeal to alternative regularity conditions. These are more flexible, but less intuitive.

The following result is a simple generalization of Lipschitz-continuity.

Theorem 22.5 The results of Theorem 22.4 hold if Assumption 4 is replaced with the following condition: For all δ>0 and all θ1N,

(E[supθθ1<δψ(Y,X,θ)ψ(Y,X,θ1)2])1/2Cδψ

for some C< and 0<ψ<.

See Theorem 18.5 of Probability and Statistics for Economists or Theorem 5 of Andrews (1994).

The bound (22.4) holds for many examples with discontinuous ψ(Y,X,θ) when the discontinuities occur with zero probability.

We next present a set of flexible results.

Theorem 22.6 The results of Theorem 22.4 hold if Assumption 4 is replaced with the following. First, for θN,ψ(Y,X,θ)G(Y,X) with E[G(Y,X)2]< . Second, one of the following holds.

  1. ψ(y,x,θ) is Lipschitz-continuous.

  2. ψ(y,x,θ)=h(θψ(x)) where h(u) has finite total variation.

  3. ψ(y,x,θ) is a combination of functions of the form in parts 1 and 2 obtained by addition, multiplication, minimum, maximum, and composition.

  4. ψ(y,x,θ) is a Vapnik-Červonenkis (VC) class.

See Theorem 18.6 of Probability and Statistics for Economists or Theorems 2 and 3 of Andrews (1994).

The function h in part 2 allows for discontinuous functions, including the indicator and sign functions. Part 3 shows that combinations of smooth (Lipschitz) functions and discontinuous functions satisfying the condition of part 2 are allowed. This covers many relevant applications, including quantile regression. Part 4 states a general condition, that ψ(y,x,θ) is a VC class. As we will not be using this property in this textbook we will not discuss this further, but refer the interested reader to any textbook on empirical processes.

Theorems 22.5 and 22.6 provide alternative conditions on ψ(y,x,θ) (other than Lipschitz-continuity) which can be used to establish asymptotic normality of an m-estimator.

22.8 Covariance Matrix Estimation

The standard estimator for V takes the sandwich form. We estimate Ω by

Ω^=1ni=1nψ^iψ^i

where ψ^i=θρi(θ^). When ρi(θ) is twice differentiable an estimator of Q is

Q^=1ni=1n2θθρi(θ^).

When ρi(θ) is not second differentiable then estimators of Q are constructed on a case-by-case basis.

Given Ω^ and Q^ an estimator for V is

V^=Q^1Ω^Q^1.

It is possible to adjust V^ by multiplying by a degree-of-freedom scaling such as n/(nk) where k= dim(θ). There is no formal guidance.

For maximum likelihood estimators the standard covariance matrix estimator is V^=Q^1. This choice is not robust to misspecification. Therefore it is recommended to use the robust version (22.5), for example by using the “, r” option in Stata. This is unfortunately not uniformly done in practice.

For clustered and time-series observations the estimator Q^ is unaltered but the estimator Ω^ changes. For clustered samples it is

Ω^=1ng=1G(=1ngψ^g)(=1ngψ^ψ^g).

For time-series data the estimator Ω^ is unaltered if the scores ψi are serially uncorrelated (which occurs when a model is dynamically correctly specified). Otherwise a Newey-West covariance matrix estimator can be used and equals

Ω^==MM(1||M+1)1n1tnψ^tψ^t.

Standard errors for the parameter estimates are formed by taking the square roots of the diagonal elements of n1V^.

22.9 Technical Proofs*

Proof of Theorem 22.1 The proof proceeds in two steps. First, we show that S(θ^)pS(θ). Second we show that this implies θ^pθ.

Since θ0 minimizes S(θ),S(θ0)S(θ^). Hence

0S(θ^)S(θ0)=S(θ^)Sn(θ^)+Sn(θ0)S(θ0)+Sn(θ^)Sn(θ0)2supθΘSn(θ)S(θ)p.

The second inequality uses the fact that θ^ minimizes Sn(θ) so Sn(θ^)Sn(θ0) and replaces the other two pairwise comparisons by the supremum. The final convergence is the assumed uniform convergence in probability.

Figure 22.2: Consistency of M-Estimator

The preceeding argument is illustrated in Figure 22.2. The figure displays the expected criterion S(θ) with the solid line, and the sample criterion Sn(θ) is displayed with the dashed line. The distances between the two functions at the true value θ0 and the estimator θ^ are marked by the two dash-dotted lines. The sum of these two lengths is greater than the vertical distance between S(θ^) and S(θ0) because the latter distance equals the sum of the two dash-dotted lines plus the vertical height of the thick section of the dashed line (between Sn(θ0) and Sn(θ^) ) which is positive because Sn(θ^)Sn(θ0). The lengths of the dotted lines converge to zero under the assumption of uniform convergence. Hence S(θ^) converges to S(θ0). This completes the first step.

In the second step of the proof we show θ^pθ. Fix ϵ>0. The unique minimum assumption implies there is a δ>0 such that θ0θ>ϵ implies S(θ)S(θ0)δ. This means that θ0θ^>ϵ implies S(θ^)S(θ0)δ. Hence

P[θ0θ^>ϵ]P[S(θ^)S(θ0)δ].

The right-hand-side converges to zero because S(θ^)pS(θ). Thus the left-hand-side converges to zero as well. Since ϵ is arbitrary this implies that θ^pθ as stated.

To illustrate, again examine Figure 22.2. We see S(θ^) marked on the graph of S(θ). Since S(θ^) converges to S(θ0) this means that S(θ^) slides down the graph of S(θ) towards the minimum. The only way for θ^ to not converge to θ0 would be if the function S(θ) were flat at the minimum. This is excluded by the assumption of a unique minimum. Proof of Theorem 22.4 Expanding the population first-order condition 0=ψ(θ0) around θ=θ^ using the mean value theorem we find

0=ψ(θ^)+Q(θn)(θ0θ^)

where θn is intermediate 3 between θ0 and θ^. Solving, we find

n(θ^θ0)=Q(θn)1nψ(θ^).

The assumption that ψ(θ) is continuously differentiable means that Q(θ) is continuous in N. Since θn is intermediate between θ0 and θ^ and the latter converges in probability to θ0, it follows that θn converges in probability to θ0 as well. Thus by the continuous mapping theorem Q(θn)pQ(θ0)=Q.

We next examine the asymptotic distribution of nψ(θ^). Define

vn(θ)=n(ψ¯n(θ)ψ(θ)).

An implication of the sample first-order condition ψn(θ^)=0 is

nψ(θ^)=n(ψ(θ^)ψn(θ^))=vn(θ^)=vn(θ0)+rn

where rn=vn(θ0)vn(θ^)

Since ψi is mean zero (see (22.3)) and has a finite covariance matrix Ω by assumption it satisfies the multivariate central limit theorem. Thus

nψn(θ)=1ni=1nψidN(0,Ω).

The final step is to show that rn=op (1). Pick any η>0 and ϵ>0. As shown by Theorem 18.5 of Probability and Statistics for Economists, Assumption 4 implies that vn(θ) is asymptotically equicontinuous, which means that (see Definition 18.7 in Probability and Statistics for Economists) given ϵ and η there is a δ>0 such that

Theorem 22.1 implies that θ^pθ0 or

lim supnP[supθθ0δvn(θ0)vn(θ)>η]ϵ.

lim supnP[θ^θ0>δ]ϵ.

We calculate that

lim supnP[rn>η]lim supnP[vn(θ0)vn(θ^)>η,θ^θ0δ]+lim supnP[θ^θ0>δ]lim supnP[supθθ0δvn(θ0)vn(θ)>η]+ϵ2ϵ.

The second inequality is (22.7) and the final inequality is (22.6). Since η and ϵ are arbitrary we deduce that rn=op(1). We conclude that

nψ(θ^)=vn(θ0)+rndN(0,Ω).

Together, we have shown that

n(θ^θ0)=Q(θn)1nψ(θ^)dQ1 N(0,Ω)N(0,Q1ΩQ1)

as claimed.

3 Technically, since ψ(θ^) is a vector, the expansion is done separately for each element of the vector so the intermediate value varies by the rows of Q(θn). This doesn’t affect the conclusion.

22.10 Exercises

Exercise 22.1 Take the model Y=Xθ+e where e is independent of X and has known density function f(e) which is continuously differentiable.

  1. Show that the conditional density of Y given X=x is f(yxθ).

  2. Find the functions ρ(Y,X,θ) and ψ(Y,X,θ).

  3. Calculate the asymptotic covariance matrix.

Exercise 22.2 Take the model Y=Xθ+e. Consider the m-estimator of θ with ρ(Y,X,θ)=g(YXθ) where g(u) is a known function.

  1. Find the functions ρ(Y,X,θ) and ψ(Y,X,θ).

  2. Calculate the asymptotic covariance matrix.

Exercise 22.3 For the estimator described in Exercise 22.2 set g(u)=14u4.

  1. Sketch g(u). Is g(u) continuous? Differentiable? Second differentiable?

  2. Find the functions ρ(Y,X,θ) and ψ(Y,X,θ).

  3. Calculate the asymptotic covariance matrix.

Exercise 22.4 For the estimator described in Exercise 22.2 set g(u)=1cos(u).

  1. Sketch g(u). Is g(u) continuous? Differentiable? Second differentiable?

  2. Find the functions ρ(Y,X,θ) and ψ(Y,X,θ).

  3. Calculate the asymptotic covariance matrix.