28 Machine Learning
28.1 Introduction
This chapter reviews machine learning methods for econometrics. This is a large and growing topic so our treatment is selective. This chapter briefly covers ridge regression, Lasso, elastic net, regression trees, bagging, random forests, ensembling, Lasso IV, double-selection/post-regularization, and double/debiased machine learning.
A classic reference is Hastie, Tibshirani, and Friedman (2008). Introductory textbooks include James, Witten, Hastie, and Tibshirani (2013) and Efron and Hastie (2017). For a theoretical treatment see Bühlmann and van der Geer (2011). For reviews of machine learning in econometrics see Belloni, Chernozhukov and Hansen (2014a), Mullainathan and Spiess (2017), Athey and Imbens (2019), and Belloni, Chernozhukov, Chetverikov, Hansen, and Kato (2021).
28.2 Big Data, High Dimensionality, and Machine Learning
Three inter-related concepts are “big data”, “high dimensionality”, and “machine learning”.
Big data is typically used to describe datasets which are unusually large and/or complex relative to traditional applications. The definition of “large” varies across discipline and time, but typically refers to datasets with millions of observations. These datasets can arise in economics from household census data, government administrative records, and supermarket scanner data. Some challenges associated with big data are storage, transmission, and computation.
High Dimensional is typically used to describe datasets with an unusually large number of variables. Again the definition of “large” varies across applications, but typically refers to hundreds or thousands of variables. In the theoretical literature “high dimensionality” is used specifically for the context where
Machine Learning is typically used to describe a set of algorithmic approaches to statistical learning. The methods are primarily focused on point prediction in settings with unknown structure. Machine learning methods generally allow for large sample sizes, large number of variables, and unknown structural form. The early literature was algorithmic with no associated statistical theory. This was followed by a statistical literature examining the properties of machine learning methods, mostly providing convergence rates under sparsity assumptions. Only recently has the literature expanded to include inference.
Machine learning embraces a large and diverse set of tools for a variety of settings, including supervised learning (prediction rules for
Machine learning arose from the computer science literature and thereby adopted a distinct set of labels to describe familar concepts. For example, it speaks of “training” rather than “estimation” and “features” rather than “regressors”. In this chapter, however, we will use standard econometric language and terminology.
For econometrics, machine learning can be thought of as “highly nonparametric”. Suppose we are interested in estimating the conditional mean
Connections between nonparametric estimation, model selection, and machine learning methods arise in tuning parameter selection by cross-validation and evaluation by out-of-sample prediction accuracy. These issues are taken seriously in machine learning applications; frequently with multiple levels of hold-out samples.
28.3 High Dimensional Regression
We are familiar with the linear regression model
It may seem shocking to contemplate an application with more regressors than observations. But the situation arises in a number of contexts. First, in our discussion of series regression (Chapter 20) we described how a regression function can be approximated by an infinite series expansion in basis transformations of the underlying regressors. Expressed as a linear model this implies a regression model with an infinite number of regressors. Practical models (as discussed in that chapter) use a moderate number of regressors in estimated regressions because this provides a balance between bias and variance. This latter models, however, are not the true conditional mean (which has an infinite number of regressors) but rather a low-dimensional best linear approximation. Second, many economic applications involve a large number of binary, discrete, and categorical variables. A saturated regression model converts all discrete and categorical variables into binary variables and includes all interactions. Such manipulations can result in thousands of regressors. For example, ten binary variables fully interacted yields 1024 regressors. Twenty binary variables fully interacted yields over one million regressors. Third, many contemporary “big” datasets contain thousands of potential regressors. Many of the variables may be low-information but it is difficult to know a priori which are relevant and which irrelevant.
When
28.4 p-norms
For discussion of ridge and Lasso regression we will be making extensive use of the 1-norm and 2norm, so it is useful to review the definition of the general p-norm. For a vector
Important special cases include the 1-norm
the 2-norm
and the sup-norm
We also define the “0-norm”
the number of non-zero elements. This is only heuristically labeled as a “norm”.
The p-norm satisfies the following additivity property. If
The following inequalities are useful. The Hölder inequality for
The case
The Minkowski inequality for
The p-norms for
Applying Hölder’s (29.1) we also have the inequality
28.5 Ridge Regression
Ridge regression is a shrinkage-type estimator with similar but distinct properties from the JamesStein estimator (see Section 28.20). There are two competing motivations for ridge regression. The traditional motivation is to reduce the degree of collinearity among the regressors. The modern motivation (though in mathematics it pre-dates the “traditional” motivation) is regularization of high-dimensional and ill-posed inverse problems. We discuss both in turn.
As discussed in the previous section, when
where
The ridge parameter
To see how
which has strictly positive eigenvalues
The second motivation is based on penalization. When
The minimizer of
The first order condition for minimization of
The solution is
Minimization subject to a penalty has a dual representation as constrained minimization. The latter is
for some
where
The practical difference between the penalization and constraint problems is that in the first you specify the ridge parameter
To find
Figure 29.1: Ridge Regression Dual Minimization Solution
To visualize the constraint problem see Figure
and minimize the SSE subject to the penalty
The solution is
where
This allows some coefficients to be penalized more (or less) than other coefficients. This added flexibility comes at the cost of selecting the ridge parameters
The most popular method to select the ridge parameter
The CV-selected ridge parameter
In practice it may be tricky to minimize
As for least squares there is a simple formula to calculate the CV criterion for ridge regression which greatly speeds computation.
Theorem 29.1 The leave-one-out ridge regression prediction errors are
where
For a proof see Exercise 29.1.
An alternative method for selection of
where
An important caveat is that the ridge regression estimator is not invariant to rescaling the regressors nor other linear transformations. Therefore it is common to apply ridge regression after applying standardizing transformations to the regressors.
Ridge regression can be implemented in
28.6 Statistical Properties of Ridge Regression
Under the assumptions of the linear regression model it is straightforward to calculate the exact bias and variance of the ridge regression estimator. Take the linear regression model
The bias of the ridge estimator with fixed
Under random sampling its covariance matrix is
where
We can measure estimation efficiency by the mean squared error (MSE) matrix
Define
Theorem 29.2 In the linear regression model, if
For a proof see Section
Theorem
where
with
Given that the ridge estimator is explicitly biased there are natural concerns about how to interpret standard errors calculated from these covariance matrix estimators. Confidence intervals calculated the usual way will have deficient coverage due to the bias. One answer is to interpret the ridge estimator
28.7 Illustrating Ridge Regression
- Cross-Validation Function
- Estimates of Return to Experience
Figure 29.2: Least Squares and Ridge Regression Estimates of the Return to Experience To illustrate ridge regression we use the CPS dataset with the sample of Asian men with a college education (16 years of education or more) to estimate the experience profile. We consider a fifth-order polynomial in experience for the conditional mean of log wages. We start by standardizing the regressors. We first center experience at its mean, create powers up to order five, and then standardized each to have mean zero and variance one. We estimate the polynomial regression by least squares and by ridge regression, the latter shrinking the five coefficients on experience but not the intercept.
We calculate the ridge parameter by cross-validation. The cross-validation function is displayed in Figure 29.2(a) over the interval [0,60]. Since we have standardized the regressors to have zero mean and unit variance the ridge parameter is scaled comparably with sample size, which in this application is
Figure 29.2(b) displays the estimated experience profiles. Least squares is displayed by dashes and ridge regression by the solid line. The ridge regression estimate is smoother and more compelling. The grey shaded region are
28.8 Lasso
In the previous section we learned that ridge regression minimizes the sum of squared errors plus a 2-norm penalty on the coefficient vector. Model selection (e.g. Mallows) minimizes the sum of squared errors plus the 0-norm penalty (the number of non-zero coefficients). An intermediate case uses the 1-norm penalty. This was proposed by Tibshirani (1996) and is known as the Lasso (for Least Absolute Shrinkage and Selection Operator). The least squares criterion with a 1-norm penalty is
The Lasso estimator is its minimizer
Except for special cases the solution must be found numerically. Fortunately, computational algorithms are surprisingly simple and fast. An important property is that when
The Lasso minimization problem has the dual constrained minimization problem
To see that the two problems are the same observe that the constrained minimization problem has the Lagrangian
which has first order conditions
This is the same as those for minimization of the penalized criterion. Thus the solutions are identical.
Figure 29.3: Lasso Dual Minimization Solution
The constraint set
The Lasso path is drawn with the dashed line. This is the sequence of solutions obtained as the constraint set is varied. The solution path has the property that it is a straight line from the least squares estimator to the
It is instructive to compare Figures
One case where we can explicitly calculate the Lasso estimates is when the regressors are orthogonal, e.g.,
which has the explicit solution
This shows that theLasso estimate is a continuous transformation of the least squares estimate. For small values of the least squares estimate the Lasso estimate is set to zero. For all other values the Lasso estimate moves the least squares estimate towards zero by
- Selection and Ridge
- Lasso
Figure 29.4: Transformations of Least Squares Estimates by Selection, Ridge, and Lasso
It is constructive to contrast this behavior with ridge regression and selection estimation. When
The Lasso and ridge estimators are continuous functions while the selection estimator is a discontinuous function. The Lasso and selection estimators are thresholding functions, meaning that the function equals zero for a region about the origin. Thresholding estimators are selection estimators because they equal zero when the least squares estimator is sufficiently small. The Lasso function is a “soft thresholding” rule as it is a continuous function with bounded first derivative. The selection estimator is a “hard thresholding” rule as it is discontinuous. Hard thresholding rules tend to have high variance due to the discontinuous transformation. Consequently, we expect the Lasso to have reduced variance relative to selection estimators, permitting overall lower MSE.
As for ridge regression, Lasso is not invariant to the scaling of the regressors. If you rescale a regressor then the penalty has a different meaning. Consequently, it is important to scale the regressors appropriately before applying Lasso. It is conventional to scale all the variables to have mean zero and unit variance.
Lasso is also not invariant to rotations of the regressors. For example, Lasso on
Applications of Lasso estimation in economics are growing. Belloni, Chernozhukov and Hansen (2014) illustrate the method using three application: (1) the effect of eminent domain on housing prices in an instrumental variables framework, (2) a re-examination of the effect of abortion on crime using the framework of Donohue and Levitt (2001), (3) a re-examination of the the effect of democracy on growth using the framework of Acemoglu, Johnson, and Robinson (2001). Mullainathan and Spiess (2017) illustrate machine learning using a prediction model for housing prices using characteristics. Oster (2018) uses household scanner data to measure the effect of a diabetes diagnosis on food purchases.
28.9 Lasso Penalty Selection
Critically important for Lasso estimation is the penalty
It is common in the statistics literature to see coefficients plotted as a function of
The most common selection method is minimization of K-fold cross validation (see Section 28.9). Leave-one-out CV is not typically used as it is computationally expensive. Many programs set the default number of folds as
K-fold cross validation is an estimator of out-of-sample mean squared forecast error. Therefore penalty selection by minimization of the K-fold criterion is aimed to select models with good forecast accuracy, but not necessarily for other purposes such as accurate inference.
Conventionally, the value of
K-fold cross validation is implemented by first randomly dividing the observations into
Asymptotic consistency of CV selection for Lasso estimation has been demonstrated by Chetverikov, Liao, and Chernozhukov (2021).
28.10 Lasso Computation
The constraint representation of Lasso is minimization of a quadratic subject to linear inequality constraints. This can be implemented by standard quadratic programming which is computationally simple. For evaluation of the cross-validation function, however, it is useful to compute the entire Lasso path. For this a computationally appropriate method is the modified LARS algorithm. (LARS stands for least angle regression.)
The LARS algorithm produces a path of coefficients starting at the origin and ending at least squares. The sequence corresponds to the sequence of constraints
Start with all coefficients equal to zero.
Find
most correlated with .Increase
in the direction of correlation.
Compute residuals along the way.
Stop when some other
has the same correlation with the residual as .If a non-zero coefficient hits zero, drop from the active set of variables and recompute the joint least squares direction.
1. Increase
- Repeat until all predictors are in model.
This algorithm produces the Lasso path. The equality between the two is not immediately apparent. The demonstration is tedious so is not shown here.
The most popular computational implementation for Lasso is the R glmnet command in the glmnet package. Penalty selection by K-fold cross validation is implmented by the
In Stata, Lasso is available with the command lasso. By default it selects the penalty by minimizing the K-fold cross validation criterion with
28.11 Asymptotic Theory for the Lasso
The current distribution theory of Lasso estimation is challenging and mostly focused on convergence rates. The results are derived under sparsity or approximate sparsity conditions, the former restricting the number of non-zero coefficients, and the second restricting how a sparse model can approximate a general parameterization. In this section we provide a basic convergence rate for the Lasso estimator
The model is the high-dimensional projection framework:
where
We provide a bound on the regression fit (29.12), the 1-norm fit
The regression fit (29.12) is similar to measures of fit we have used before, including the integrated squared error (20.22) in series regression and the regression fit
When
Assumption 29.1 Restricted Eigenvalue Condition (REC) With probability approaching 1 as
To gain some understanding of what the REC means, notice that if the minimum (29.13) is taken without restriction over
We provide a rate for the Lasso estimator under the assumption of normal errors. Theorem 29.3 Suppose model (29.11) holds with
Then there is
and
For a proof see Section
Theorem
We stated earlier in this section that we had assumed that the coefficient vector
The key to establishing Theorem
A theory which allows for non-normal heteroskedastic errors has been developed by Belloni, Chen, Chernozhukov, and Hansen (2012). These authors examine an alternative Lasso estimator which adds regresssor-specific weights to the penalty function, with weights equalling the square roots of
28.12 Approximate Sparsity
The theory of the previous section used the strong assumption that the true regression is sparse: only a subset of the coefficients are non-zero, and the convergence rate depends on the cardinality of the nonzero coefficients. However, as shown by Belloni, Chen, Chernozhukov, and Hansen (2012), strict sparsity is not required. Instead, convergence rates similar to those in Theorem
Once again take the high-dimensional regression model (29.11) but do not assume that
with associated approximation error
Assumption 29.2 Approximate Sparsity. For some
Assumption
Belloni, Chen, Chernozhukov, and Hansen (2012) show that convergence results similar to Theorem 29.3 hold under the approximate sparsity condition of Assumption 29.2. The convergence rates are slowed and depend on the approximation exponent
The approximate sparsity condition fails when the regressors cannot be easily ordered. Suppose, for example, that
28.13 Elastic Net
The difference between Lasso and ridge regression is that the Lasso uses the 1-norm penalty while ridge uses the 2-norm penalty. Since both procedures have advantages it seems reasonable that further improvements may be obtained by a compromise. Taking a weighted average of the penalties we obtain the Elastic Net criterion
with weight
Typically the parameters
Elastic net can be implemented in R with the glmnet command. In Stata use elasticnet or the downloadable package lassopack.
28.14 Post-Lasso
The Lasso estimator
The procedure takes two steps. First, estimate the model
The post-Lasso is a hard thresholding or post-model-selection estimator. Indeed, when the regressors are orthogonal the post-Lasso estimator precisely equals a selection estimator, transforming the least squares coefficient estimates using the hard threshold function displayed in Figure 29.4(a). Consequently, the post-Lasso estimator inherits the statistical properties of PMS estimators (see Sections
28.15 Regression Trees
Regression trees were introduced by Breiman, Friedman, Olshen, and Stone (1984), and are also known by the acronym CART for Classification and Regression Trees. A regression tree is a nonparametric regression using a large number of step functions. The idea is that with a sufficiently large number of split points, a step function can approximate any function. Regression trees may be especially useful when there are a combination of continuous and discrete regressors so that traditional kernel and series methods are challenging to implement.
Regression trees can be thought of as a
The goal is to estimate
The literature on regression trees has developed some colorful language to describe the tools based on the metaphor of a living tree. 1. A subsample is a branch.
1. Terminal branches are nodes or leaves.
Increasing the number of branches is growing a tree.
Decreasing the number of branches is pruning a tree.
The basic algorithm starts with a single branch. Grow a large tree by sequentially splitting the branches. Then prune back using an information criterion. The goal of the growth stage is to develop a rich datadetermined tree which has small estimation bias. Pruning back is an application of backward stepwise regression with the goal of reducing over-parameterization and estimation variance.
The regression tree algorithm makes extensive use of the regression sample split algorithm. This is a simplified version of threshold regression (Section 23.7). The method uses NLLS to estimate the model
with the index
The basic growth algorithm is as follows. The observations are
Select a minimum node size
(say 5). This is the minimal number of observations on each leaf.Sequentially apply regression sample splits.
Apply the regression sample split algorithm to split each branch into two sub-branches, each with size at least
.On each sub-branch
:
Take the sample mean
of for observations on the sub-branch.This is the estimator of the regression function on this sub-branch.
The residuals on the sub-branch are
.
Select the branch whose split most reduces the sum of squared errors.
Split this branch into two branches. Make no other split.
Repeat (a)-(d) until each branch cannot be further split. The terminal (unsplit) branches are the leaves.
After the growth algorithm has been run, the estimated regression is a multi-dimensional step function with a large number of branches and leaves.
The basic pruning algorithm is as follows.
- Define the Mallows-type information criterion
where
1. Use backward stepwise regression to reduce the number of leaves:
Identify the leaf whose removal most decreases
.Prune (remove) this leaf.
If there is no leaf whose removal decreases
then stop pruning.Otherwise, repeat (a)-(c).
The penalty parameter
The advantage of regression trees is that they provide a highly flexible nonparametric approximation. Their main use is prediction. One disadvantage of regression trees is that the results are difficult to interpret as there are no regression coefficients. Another disadvantage is that the fitted regression
The sampling distribution of regression trees is difficult to derive, in part because of the strong correlation between the placement of the sample splits and the estimated means. This is similar to the problems associated with post-model-selection. (See Sections
Regression trees algorithms are implemented in the R package rpart.
28.16 Bagging
Bagging (bootstrap aggregating) was introduced by Breiman (1996) as a method to reduce the variance of a predictor. We focus here on its use for estimation of a conditional expectation. The basic idea is simple. You generate a large number
Bagging is believed to be useful when the CEF estimator has low bias but high variance. This occurs for hard thresholding estimators such as regression trees, model selection, and post-Lasso. Bagging is a smoothing operation which reduces variance. The resulting bagging estimator can have lower MSE as a result. Bagging is believed to be less useful for estimators with high bias, as bagging may exaggerate the bias.
We first describe the estimation algorithm. Let
As
To understand the bagging process we use an example from Bühlmann and Yu (2002). As in Section
Suppose that the bagging estimator is constructed using the parametric bootstrap
- Selection and Bagging Transformations
- MSE of Selection and Bagging Estimators
Figure 29.5: Bagging and Selection
Bühlmann and Yu (2002) argue that smooth transformations generally have lower variances than hard threshold transformations, and thus argue that
The most common application of bagging is to regression trees. Trees have a similar structure to our example selection estimator
Wager, Hastie, and Efron (2014) propose estimators of
This variance estimator is based on Efron (2014).
While Breiman’s proposal and most applications of bagging are implemented using the nonparametric bootstrap, an alternative is to use subsampling. A subsampling estimator is based on sampling without replacement rather than with replacement as done in the conventional bootstrap. Samples of size
28.17 Random Forests
Random forests, introduced by Breiman (2001), are a modification of bagged regression trees. The modification is designed to reduce estimation variance. Random forests are popular in machine learning applications and have effectively displaced simple regression trees.
Consider the procedure of applying bagging to regression trees. Since bootstrap samples are similar to one another the estimated bootstrap regression trees will also be similar to one another, particularly in the sense that they tend to have the splits based on the same variables. This means that conditional on the sample the bootstrap regression trees are positively correlated. This correlation means that the variance of the bootstrap average remains high even when the number of bootstrap replications
The basic random forest algorithm is as follows. The recommended defaults are taken from the description in Hastie, Tibshirani, and Friedman (2008).
Pick a minimum leaf size
(default ), a minimal split fraction ), and a sampling number (default ).For
:
Draw a nonparametric bootstrap sample.
Grow a regression tree on the bootstrap sample using the following steps: i. Select
variables at random from the regressors.
Among these
variables, pick the one which produces the best regression split, where each split subsample has at least observations and at least a fraction of the observations in the branch.Split the bootstrap sample accordingly.
Stop when each leaf has between
and observations.Set
as the sample mean of on each leaf of the bootstrap tree.
1.
Using randomization to reduce the number of variables from
The infinitesimal jackknife (29.18) can be used for variance and standard error estimation, as discussed in Wager, Hastie, and Efron (2014).
While random forests are popular in applications, a distributional theory has been slow to develop. Some of the more recent results have made progress by focusing on random forests generated by subsampling rather than bootstrap (see the discussion at the end of the previous section).
A variant proposed by Wager and Athey (2018) is to use honest trees (see the discussion at the end of Section 29.15) to remove the dependence between the sample splits and the sample means.
Consistency and asymptotic normality has been established by Wager and Athey (2018). They assume that the conditional expectation and variance are Lipschitz-continuous in
for some variance sequence
Furthermore, Wager and Athey (2018) assert (but do not provide a proof) that the variance
The standard computational implementation of random forests is the R randomForest command.
28.18 Ensembling
Ensembling is the term used in machine learning for model averaging across machine learning algorithms. Ensembling is popular in applied machine learning.
We discussed model averaging models in Sections 28.26-28.31. Ensembling for machine learning can use many of the same methods. One popular method known as stacking is the same as Jackknife Model Averaging discussed in Section 28.29. This selects the model averaging weights by minimizing a cross-validation criterion, subject to the constraint that the weights are non-negative and sum to one.
Unfortunately, the theoretical literature concerning ensembling is thin. Much of the advice concerning specific methods is based on empirical performance.
28.19 Lasso IV
Belloni, Chen, Chernozhukov, and Hansen (2012) propose Lasso for estimation of the reduced form of an instrumental variables regression.
The model is linear IV
where
The reduced form equations for the endogenous regressors are
The paper discusses alternative formulations. One is obtained by split-sample estimation as in Angrist and Krueger (1995) (see Section 12.14). Divide the sample randomly into two independent halves
We can reverse the procedure. Use
Using the asymptotic theory for Lasso estimation the authors show that these estimators are equivalent to estimation using the infeasible instrument
Theorem 29.4 Under the Assumptions listed in Theorem 3 of Belloni, Chen, Chernozhukov, and Hansen (2012), including
then
where
For a sketch of the proof see Section
Equation (29.19) requires that the reduced form coefficient
For Lasso SSIV, equation (29.21) replaces (29.19). This rate condition is weaker, allowing
Belloni, Chen, Chernozhukov, and Hansen (2012) extend Theorem
An important disadvantage of the split-sample and cross-fit estimators is that they depend on the random sorting of the observations into the samples
IV Lasso can be implemented in Stata using the downloadable package ivlasso.
28.20 Double Selection Lasso
Post-estimation inference is difficult with most machine learning estimators. For example, consider the post-Lasso estimator (least squares applied to the regressors selected by the Lasso). This is a post- model-selection (PMS) estimator, as discussed in Sections
Consider the linear model
where
Suppose you estimate model (29.22) by group post-Lasso, only penalizing
Belloni, Chernozhukov, and Hansen (2014b) deduce that improved coverage accuracy can be achieved if the variable
Substituting (29.23) into (29.22) we obtain a reduced form for
where
The double-selection estimator as recommended by Belloni, Chernozhukov, and Hansen (2014b) is:
Estimate (29.23) by Lasso. Let
be the selected variables from .Estimate (29.24) by Lasso. Let
be the selected variables from .Let
be the union of the variables in and .Regress
on to obtain the double-selection coefficient estimate .Calculate a conventional (heteroskedastic) standard error for
.
Belloni, Chernozhukov, and Hansen (2014b) show that when both (29.22) and (29.23) satisfy an approximate sparsity structure (so that the regressions are well approximated by a finite set of regressors) then the double-selection estimator
It should be emphasized that this distributional claim is asymptotic; finite sample inferences remain distorted from nominal levels. Furthermore, the result rests on the adequacy of the approximate sparsity assumption for both the structural equation (29.22) and the auxillary regression (29.23).
The primary advantage of the double-selection estimator is its simplicity and clear intuitive structure.
In Stata, the double-selection Lasso estimator can be computed by the dsregress command or with the pdslasso add-on package. Double-selection is available in
28.21 Post-Regularization Lasso
A potential improvement on double-selection Lasso is the post-regularization Lasso estimator of Chernozhukov, Hansen, and Spindler (2015), which is labeled as partialing-out Lasso in the Stata manual. The estimator is essentially the same as Robinson (1988) for the partially linear model (see Section 19.24) but estimated by Lasso rather than kernel regression.
We first transform the structural equation (29.22) to eliminate the high-dimensional component. Take the expected value of (29.22) conditional on
Notice that this elminates the regressor
If
The estimator recommended by Chernozhukov, Hansen, and Spindler (2015) is:
Estimate (29.23) by Lasso or post-Lasso with Lasso parameter
. Let be the coefficient estimator and the residual.Estimate (29.24) by Lasso or post-Lasso with Lasso parameter
. Let be the coefficient estimator and the residual.Let
be the OLS coefficient from the regression of on .Calculate a conventional (heteroskedastic) standard error for
.
Chernozhukov, Hansen, and Spindler (2015) introduce the following insight to understand why
Its sensitivity with respect to
which is non-zero when
In contrast, the moment condition for
Its sensitivity with respect to
This equals zero because
These insights are formalized in the following distribution theory.
Theorem 29.5 Suppose model (29.22)-(29.23) holds and Assumption
Then
Furthermore, the standard variance estimator for
For a proof see Section
In order to provide a simple proof, Theorem
Theorem
The partialing-out Lasso estimator is available with the poregress command in Stata (implemented with post-Lasso estimation only), or with the pdslasso add-on package. Partialing-out Lasso is available in
28.22 Double/Debiased Machine Learning
The most recent contribution to inference methods for model (29.22) is the Double/Debiased machine learning (DML) estimator of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018). Our description will focus on linear regression estimated by Lasso, though their treatment is considerably more general. This estimation method has received considerable attention among econometricians in recent years and is considered the state-of-the-art estimation method.
The DML estimator extends the post-regularization estimator of the previous section by adding samplesplitting similarly to the split-sample IV estimator (see Section 29.19). The authors argue that this reduces the dependence between the estimation stages and can improve performance.
As presented in the previous section, the post-regularization estimator first estimates the coefficients
Randomly partition the sample into
independent folds , of roughly equal size .Write the data matrices for each fold as
.For
Use all observations except for fold
to estimate the coefficients and in (29.23) and (29.24) by Lasso or post-Lasso. Write these leave-fold-out estimators as and .Set
and . These are the estimated values of and for observations in the fold using the leave-fold-out estimators.
1. Set
- Construct a conventional (heteroskedastic) standard error for
.
The authors call
The estimator requires the selection of the number of folds
Theorem 29.6 Under the assumptions of Theorem 29.5,
Furthermore, the standard variance estimator for
Theorem
The authors argue that the DML estimator has improved sampling performance due to an improved rate of convergence of certain error terms. If we examine the proof of Theorem 29.5, one of the error bounds is (29.44), which shows that
Under sample splitting, however, we have an improved rate of convergence. The components
The advantage of the DML estimator over the post-regularization estimator is that the sample splitting eliminates the dependence between the two estimation steps, thereby reducing post-model-selection bias. The procedure has several disadvantages, however. First, the estimator is random due to the sample splitting. Two researchers with the same data set but making different random splits will obtain two distinct estimators. This arbitrariness is unsettling. This randomness can be reduced by using a larger value of
At the beginning of this section the DML estimator was described as the “state-of-the-art”. This field is rapidly developing so this specific estimator may be soon eclipsed by a further iteration.
In Stata, the DML estimator is available with the xporegress command. By default it implements the DML2 estimator with
28.23 Technical Proofs*
Proof of Theorem 29.2 Combining (29.8) and (29.9) we find that
The MSE of the least squares estimator is
Their difference is
where
The right-hand-side of (29.28) is positive definite if
which is strictly positive when
Proof of Theorem 29.3 Define
By Boole’s inequality (B.24), (29.29), Jensen’s inequality,
Since
holds with arbitrarily large probability. The remainder of the proof is algebraic, based on manipulations of the estimation criterion function, conditional on the event (29.31).
Since
Writing out the left side, dividing by
The second inequality is Hölder’s (29.2) and the third holds by (29.31).
Partition
the second inequality using the fact
An implication of (29.32) is
This is the only (but key) point in the proof where Assumption
Together with (29.32), (29.33) implies
The second inequality is (29.4). The third is
which is (29.17) with
This implies (29.15) with
Equation (29.34) also implies
Using (29.4) and (29.17)
Hence
which is (29.16) with
Proof of Theorem 29.4 We provide a sketch of the proof. We start with Lasso IV. First, consider the idealized estimator
For simplicity assume that
and
Similar to (29.30), under sufficient regularity conditions
By the Schwarz inequality and (29.37)
the final inequality under (29.19). This establishes (29.35).
By the Hölder inequality (29.2), (29.38), and (29.39),
the final inequality under (29.19). This establishes (29.36).
Now consider Lasso SSIV. The steps are essentially the same except for (29.40). For this we use the fact that
the final bounds by (29.37) and (29.21). Thus
Proof of Theorem 29.5 The idealized estimator
which has the stated asymptotic distribution. The Theorem therefore holds if replacement of
The denominator equals
The numerator equals
The terms on the right side beyond the first are asymptotically negligible because
by Theorem
by the Schwarz inequality and the above results, and
by Hölder’s (29.2), Theorem 29.3, (29.39), and Assumption (29.26). Similarly
Together we have shown that in (29.41), the replacement of
28.24 Exercises
Exercise 29.1 Prove Theorem 29.1. Hint: The proof is similar to that of Theorem 3.7.
Exercise 29.2 Show that (29.7) is the Mallows criterion for ridge regression. For a definition of the Mallows criterion see Section
Exercise 29.3 Derive the conditional bias (29.8) and variance (29.9) of the ridge regression estimator.
Exercise 29.4 Show that the ridge regression estimator can be computed as least squares applied to an augmented data set. Take the original data
Exercise 29.5 Which estimator produces a higher regression
Exercise 29.6 Does ridge regression require that the columns of
Exercise 29.7 Repeat the previous question for Lasso regression. Show that the Lasso coefficient estimates
Exercise 29.8 You have the continuous variables
Exercise 29.9 Take the cpsmar09 dataset and the subsample of Asian women
Exercise 29.10 Repeat the above exercise using the subsample of Hispanic men