27 Model Selection, Stein Shrinkage, and Model Averaging
27.1 Introduction
The chapter reviews model selection, James-Stein shrinkage, and model averaging.
Model selection is a tool for selecting one model (or estimator) out of a set of models. Different model selection methods are distinguished by the criteria used to rank and compare models.
Model averaging is a generalization of model selection. Models and estimators are averaged using data-dependent weights.
James-Stein shrinkage modifies classical estimators by shrinking towards a reasonable target. Shrinking reduces mean squared error.
Two excellent monographs on model selection and averaging are Burnham and Anderson (1998) and Claeskens and Hjort (2008). James-Stein shrinkage theory is thoroughly covered in Lehmann and Casella (1998). See also Wasserman (2006) and Efron (2010).
27.2 Model Selection
In the course of an applied project an economist will routinely estimate multiple models. Indeed, most applied papers include tables displaying the results from different specifications. The question arises: Which model is best? Which should be used in practice? How can we select the best choice? This is the question of model selection.
Take, for example, a wage regression. Suppose we want a model which includes education, experience, region, and marital status. How should we proceed? Should we estimate a simple linear model plus a quadratic in experience? Should education enter linearly, a simple spline as in Figure 2.6(a), or with separate dummies for each education level? Should marital status enter as a simple dummy (married or not) or allowing for all recorded categories? Should interactions be included? Which? How many? Taken together we need to select the specific regressors to include in the regression model.
Model “selection” may be mis-named. It would be more appropriate to call the issue “estimator selection”. When we examine a table containing the results from multiple regressions we are comparing multiple estimates of the same regression. One estimator may include fewer variables than another; that is a restricted estimator. One may be estimated by least squares and another by 2SLS. Another could be nonparametric. The underlying model is the same; the difference is the estimator. Regardless, the literature has adopted the term “model selection” and we will adhere to this convention. To gain some basic understanding it may be helpful to start with a stylzed example. Suppose that we have a
The calculations simplify by setting
For our two estimators we calculate that
(See Exercise 28.1) The WMSE of
The comparison between (28.1) and (28.2) is a basic bias-variance trade-off. The estimator
Selection based on WMSE suggests that we should ideally select the estimator
so is biased. An unbiased estimator is
This is an overly-simplistic setting but highlights the fundamental ingredients of criterion-based model selection. Comparing the MSE of different estimators typically involves a trade-off between the bias and variance with more complicated models exhibiting less bias but increased estimation variance. The actual trade-off is unknown because the bias depends on the unknown true parameters. The bias, however, can be estimated, giving rise to empirical estimates of the MSE and empirical model selection rules.
A large number of model selection criteria have been proposed. We list here those most frequently used in applied econometrics.
We first list selection criteria for the linear regression model
27.3 Bayesian Information Criterion
27.4 Akaike Information Criterion
27.5 Cross-Validation
where
We next list two commonly-used selection criteria for likelihood-based estimation. Let
27.6 Bayesian Information Criterion
27.7 Akaike Information Criterion
In the following sections we derive and discuss these and other model selection criteria.
27.8 Bayesian Information Criterion
The Bayesian Information Criterion (BIC), also known as the Schwarz Criterion, was introduced by Schwarz (1978). It is appropriate for parametric models estimated by maximum likelihood and is used to select the model with the highest approximate probability of being the true model.
Let
The marginal density
Schwarz (1978) established the following approximation.
Theorem 28.1 Schwarz. If the model
where the
A heuristic proof for normal linear regression is given in Section 28.32. A “diffuse” prior is one which distributes weight uniformly over the parameter space.
Schwarz’s theorem shows that the marginal likelihood approximately equals the maximized likelihood multiplied by an adjustment depending on the number of estimated parameters and the sample size. The approximation (28.6) is commonly called the Bayesian Information Criterion or BIC. The BIC is a penalized
where
Since
Now suppose that we have two models
Bayes selection picks the model with highest probability. Thus if
Finding the model with highest marginal likelihood is the same as finding the model with lowest value of
The above discussion concerned two models but applies to any number of models. BIC selection picks the model with the smallest BIC. For implementation you simply estimate each model, calculate its BIC, and compare. model.
The BIC may be obtained in Stata by using the command estimates stats after an estimated
27.9 Akaike Information Criterion for Regression
The Akaike Information Criterion (AIC) was introduced by Akaike (1973). It is used to select the model whose estimated density is closest to the true density. It is designed for parametric models estimated by maximum likelihood.
Let
To measure the distance between the two densities
Notice that
Thus
This is random as it depends on the estimator
The first term in (28.8) does not depend on the model. So minimization of expected KLIC distance is minimization of the second term. Multiplied by 2 (similarly to the BIC) this is
The expectation is over the random estimator
An alternative interpretation is to notice that the integral in (28.9) is an expectation over
where
To gain further understanding we consider the simple case of the normal linear regression model with
The expected value at the true parameter values is
To simplify the calculations, we add the assumption that the variance
Theorem 28.2 Suppose
The proof is given in Section
Expression (28.13) shows the converse story. It shows that the sample log-likelihood function is smaller than the idealized value
Combining these expressions we can suggest an unbiased estimator for
Interestingly the AIC takes a similar form to the BIC. Both the AIC and BIC are penalized log likelihoods, and both penalties are proportional to the number of estimated parameters
Selecting a model by the AIC is equivalent to calculating the AIC for each model and selecting the model with the lowest
Theorem 28.3 Under the assumptions of Theorem 28.2,
One of the interesting features of these results are that they are exact - there is no approximation and they do not require that the true error is normally distributed. The critical assumption is conditional homoskedasticity. If homoskedasticity fails then the AIC loses its validity.
The AIC may be obtained in Stata by using the command estimates stats after an estimated model.
27.10 Akaike Information Criterion for Likelihood
For the general likelihood context Akaike proposed the criterion (28.7). Here,
As for regression, AIC selection is performed by estimating a set of models, calculating AIC for each, and selecting the model with the smallest AIC.
The advantages of the AIC are that it is simple to calculate, easy to implement, and straightforward to interpret. It is intuitive as it is a simple penalized likelihood.
The disadvantage is that its simplicity may be deceptive. The proof shows that the criterion is based on a quadratic approximation to the log likelihood and an asymptotic chi-square approximation to the classical Wald statistic. When these conditions fail then the AIC may not be accurate. For example, if the model is an approximate (quasi) likelihood rather than a true likelihood then the failure of the information matrix equality implies that the classical Wald statistic is not asymptotically chi-square. In this case the accuracy of AIC fails. Another problem is that many nonlinear models have parameter regions where parametric identification fails. In these models the quadratic approximation to the log
The following is an analog of Theorem 28.3.
Theorem 28.4 Under standard regularity conditions for maximum likelihood estimation, plus the assumption that certain statistics (identified in the proof) are uniformly integrable,
A sketch of the proof is given in Section
This result shows that the AIC is, in general, a reasonable estimator of the KLIC fit of an estimated parametric model. The theorem holds broadly for maximum likelihood estimation and thus the AIC can be used in a wide variety of contexts.
27.11 Mallows Criterion
The Mallows Criterion was proposed by Mallows (1973) and is often called the
Take the homoskedastic regression framework
Write the first equation in vector notation for the
Mallows (1973) proposed the criterion
where
The Mallows crierion can be used similarly to the AIC. A set of regression models are estimated and the criterion
Mallows designed the criterion
This is the expected squared difference between the estimated and true regressions evaluated at the observations.
An alternative motivation for
The best possible (infeasible) value of this quantity is
The difference is the prediction accuracy of the estimator:
which equals Mallows’ measure of fit. Thus
We stated that the Mallows criterion is an unbiased estimator of
Theorem 28.5 If
so the adjusted Mallows criterion
The proof is given in Section 28.32.
27.12 Hold-Out Criterion
Dividing the sample into two parts, one for estimation and the second for evaluation, creates a simple device for model evaluation and selection. This procedure is often labelled hold-out evaluation. In the recent machine learning literature the data division is typically described as a training sample and a test sample.
The sample is typically divided randomly so that the estimation (training) sample has
Take the standard case of a bipartite division where
Combining this coefficient with the evaluation sample we calculate the prediction errors
In Section
We can see that
When
The estimated MSFE
The hold-out method has two disadvantages. First, if our goal is estimation using the full sample, our desired estimate is
27.13 Cross-Validation Criterion
In applied statistics and machine learning the default method for model selection and tuning parameter selection is cross-validation. We have introduced some of the concepts throughout the textbook, and review and unify the concepts at this point. Cross-validation is closely related to the hold-out criterion introduced in the previous section.
In Section
The estimator CV is called the cross-validation (CV) criterion. It is a natural generalization of the hold-out criterion and eliminates the two disadvantages described in the previous section. First, the CV criterion is an unbiased estimator of MSFE
In least squares estimation the CV criterion has a simple computational implementation. Theorem 3.7 shows that the leave-one-out least squares estimator (3.42) equals
where
where the second equality is from Theorem 3.7. Consequently the CV criterion is
Recall as well that in our study of nonparametric regression (Section 19.12) we defined the crossvalidation criterion for kernel regression as the weighted average of the squared prediction errors
Theorem
In Section
Implementation of CV model selection is the same as for the other criteria. A set of regression models are estimated. For each the CV criterion is calculated. The model with the smallest value of CV is the CVselected model.
The CV method is also much broader in concept and potential application. It applies to any estimation method so long as a “leave one out” error can be calculated. It can also be applied to other loss functions beyond squared error loss. For example, a cross-validation estimate of absolute loss is
Computationally and conceptually it is straightforward to select models by minimizing such criterion. However, the properties of applying CV to general criterion is not known.
Stata does not have a standard command to calculate the CV criterion for regression models.
27.14 K-Fold Cross-Validation
There are two deficiencies with the CV criterion which can be alleviated by the closely related K-fold cross-validation criterion. The first deficiency is that CV calculation can be computationally costly when sample sizes are very large or the estimation method is other than least squares. For estimators other than least squares it may be necessary to calculate
An alternative is is to split the sample into
The method works by the following steps. This description is for estimation of a regression model
Randomly sort the observations.
Split the observations into folds
of (roughly) equal size . Let denote the observations in foldFor
Exclude fold
from the dataset. This produces a sample with observations.Calculate the estimator
on this sample.Calculate the prediction errors
for .Calculate
1. Calculate
If
A useful feature of
This is similar to a clustered variance formula, where the folds are treated as clusters. The standard error
One disadvantage of K-fold cross-validation is that CV can be sensitive to the initial random sorting of the observations, leading to partially arbitrary results. This problem can be reduced by a technique called repeated CV, which repeats the K-fold CV algorithm
CV model selection is typically implemented by selecting the model with the smallest value of CV. An alternative implementation is known as the one standard error (1se) rule and selects the most parsimonious model whose value of CV is within one standard error of the minimum CV. The (informal) idea is that models whose value of
27.15 Many Selection Criteria are Similar
For the linear regression model many selection criteria have been introduced. However, many of these alternative criteria are quite similar to one another. In this section we review some of these connections. The following discussion is for the standard regression model
Shibata (1980) proposed the criteria
as an estimator of the MSFE. Recalling the Mallows criterion for regression (28.15) we see that Shibata =
Taking logarithms and using the approximation
Thus minimization of Shibata’s criterion and AIC are similar.
Akaike (1969) proposed the Final Prediction Error Criteria
Using the expansions
Craven and Wahba (1979) proposed Generalized Cross Validation
By the expansion
The above calculations show that the WMSE, AIC, Shibata, FPE, GCV, and Mallows criterion are all close approximations to one another when
The third-to-last line holds asymptotically by the WLLN. The following equality holds under conditional homoskedasiticity. The final approximation replaces
27.16 Relation with Likelihood Ratio Testing
Since the AIC and BIC are penalized log-likelihoods, AIC and BIC selection are related to likelihood ratio testing. Suppose we have two nested models
or
where
There are two useful practical implications. One is that when test statistics are reported in their
Another useful implication is in the case of considering a single coefficient (when
Similar comments apply to BIC selection though the effective critical values are different. For comparing models with coefficients
27.17 Consistent Selection
An important property of a model selection procedure is whether it selects a true model in large samples. We call such a procedure consistent.
To discuss this further we need to thoughtfully define what is a “true” model. The answer depends on the type of model.
When a model is a parametric density or distribution
In a semiparametric conditional moment condition model which states
In a semiparametric unconditional moment condition model
In a nonparametric model such as
A complication arises that there may be multiple true models. This cannot occur when models are strictly non-nested (meaning that there is no common element in both model classes) but strictly nonnested models are rare. Most models have non-trivial intersections. For example, the linear regression models
The most common type of intersecting models are nested. In regression this occurs when the two models are
In general, given a set of models
A model selection rule
Definition 28.1 A model selection rule is model selection consistent if
This states that the model selection rule selects a true model with probability tending to 1 as the sample size diverges.
A broad class of model selection methods satisfy this definition of consistency. To see this consider the class of information criteria
This includes AIC
Theorem 28.6 Under standard regularity conditions for maximum likelihood estimation, selection based on IC is model selection consistent if
The proof is given in Section
This result covers AIC, BIC and testing-based selection. Thus all are model selection consistent.
A major limitation with this result is that the definition of model selection consistency is weak. A model may be true but over-parameterized. To understand the distinction consider the models
Definition 28.2 A model selection rule asymptotically over-selects if there are models
The definition states that over-selection occurs when two models are nested and the smaller model is true (so both models are true models but the smaller model is more parsimonious) if the larger model is asymptotically selected with positive probability.
Theorem 28.7 Under standard regularity conditions for maximum likelihood estimation, selection based on IC asymptotically over-selects if
The proof is given in Section
This result includes both AIC and testing-based selection. Thus these procedures over-select. For example, if the models are
Following this line of reasoning, it is useful to draw a distinction between true and parsimonious models. We define the set of parsimonious models
Of the methods we have reviewed, only BIC selection is consistent for parsimonious models, as we now show.
Theorem 28.8 Under standard regularity conditions for maximum likelihood estimation, selection based on IC is consistent for parsimonious models if for all
as
The proof is given in Section 28.32.
The condition includes BIC because
Some economists have interpreted Theorem
27.18 Asymptotic Selection Optimality
Regressor selection by the AIC/Shibata/Mallows/CV class turns out to be asymptotically optimal with respect to out-of-sample prediction under quite broad conditions. This may appear to conflict with the results of the previous section but it does not as there is a critical difference between the goals of consistent model selection and accurate prediction.
Our analysis will be in the homoskedastic regression model conditioning on the regressor matrix
where
The
As in Section
though now we index
In any sample there is an optimal model
Model
Now consider model selection using the Mallow’s criterion for regression models
where we explicitly index by
Prediction accuracy using the Mallows-selected model is
We consider convergence in (28.18) in terms of the risk ratio because
Li (1987) established the asymptotic optimality (28.18). His result depends on the following conditions.
The observations
, are independent and identically distributed. . . for some . asThe estimated models are nested.
Assumptions 28.1.2 and 28.1.3 state that the true model is a conditionally homoskedastic regression. Assumption 28.1.4 is a technical condition, that a conditional moment of the error is uniformly bounded. Assumption 28.1.5 is subtle. It effectively states that there is no correctly specified finite-dimensional model. To see this, suppose that there is a
Theorem 28.9 Assumption
The proof is given in Section 28.32.
Theorem
Using a similar argument, Andrews (1991c) showed that selection by cross-validation satisfies the same asymptotic optimality condition without requiring conditional homoskedasticity. The treatment is a bit more technical so we do not review it here. This indicates an important advantage for crossvalidation selection over the other methods.
27.19 Focused Information Criterion
Claeskens and Hjort (2003) introduced the Focused Information Criterion (FIC) as an estimator of the MSE of a scalar parameter. The criterion is appropriate in correctly-specified likelihood models when one of the estimated models nests the other models. Let
The class of models (sub-models) allowed are those defined by a set of differentiable restrictions
A key feature of the FIC is that it focuses on a real-valued parameter
Estimation accuracy is measured by the MSE of the estimator of the target parameter, which is the squared bias plus the variance:
It turns out to be convenient to normalize the MSE by that of the unrestricted estimator. We define this as the Focus
The Claeskens-Hjort FIC is an estimator of F. Specifically,
where
In a least squares regression
The FIC is used similarly to AIC. The FIC is calculated for each sub-model of interest and the model with the lowest value of FIC is selected.
The advantage of the FIC is that it is specifically targeted to minimize the MSE of the target parameter. The FIC is therefore appropriate when the goal is to estimate a specific target parameter. A disadvantage is that it does not necessarily produce a model with good estimates of the other parameters. For example, in a linear regression
Computationally it may be convenient to implement the FIC using an alternative formulation. Define the adjusted focus
This adds the same quantity to all models and therefore does not alter the minimizing model. Multiplication by
where
is an estimator of
This means that
The formula (28.19) also provides an intuitive understanding of the FIC. When we minimize FIC* we are minimizing the variance of the estimator of the target parameter
When selecting from amongst just two models, the FIC selects the restricted model if
Second take the restricted MLE. Under standard conditions
where
The plug-in estimator
It follows that an approximately unbiased estimator for
The FIC is obtained by replacing the unknown
27.20 Best Subset and Stepwise Regression
Suppose that we have a set of potential regressors
If
If the goal is to find the set of regressors which produces the smallest selection criterion it seems likely that we should be able to find an approximating set of regressors at much reduced computation cost. Some specific algorithms to implement this goal are as called stepwise, stagewise, and least angle regression. None of these procedures actually achieve the goal of minimizing any specific selection criterion; rather they are viewed as useful computational approximations. There is also some potential confusion as different authors seem to use the same terms for somewhat different implementations. We use the terms here as described in Hastie, Tibshirani, and Friedman (2008).
In the following descriptions we use
27.21 Backward Stepwise Regression
Start with all regressors
included in the “active set”.For
Estimate the regression of
on the active set.Identify the regressor whose omission will have the smallest impact on
.Put this regressor in slot
and delete from the active set.Calculate
and store in slot .
1. The model with the smallest value of
Backward stepwise regression requires
27.22 Forward Stepwise Regression
Start with the null set
as the “active set” and all regressors as the “inactive set”.For
Estimate the regression of
on the active set.Identify the regressor in the inactive set whose inclusion will have the largest impact on
.Put this regressor in slot
and move it from the inactive to the active set.Calculate
and store in slot .
1. The model with the smallest value of
A simplified version is to exit the loop when
There are combined algorithms which check both forward and backward movements at each step. The algorithms can also be implemented with the regressors organized in groups (so that all elements are either included or excluded at each step). There are also old-fashioned versions which use significance testing rather than selection criterion (this is generally not advised).
Stepwise regression based on old-fashioned significance testing can be implemented in Stata using the stepwise command. If attention is confined to models which include regressors one-at-a-time, AIC selection can be implemented by setting the significance level equal to
Stepwise regression can be implemented in R using the lars command.
27.23 The MSE of Model Selection Estimators
Model selection can lead to estimators with poor sampling performance. In this section we show that the mean squared error of estimation is not necessarily improved, and can be considerably worsened, by model selection.
To keep things simple consider an estimator with an exact normal distribution and known covariance matrix. Normalizing the latter to the identity we consider the setting
and the class of model selection estimators
for some
We can explicitly calculate the MSE of
Theorem 28.10 If
where
The proof is given in Section
The MSE is determined only by
We can see the following limiting cases. If
Using Theorem
In the plots you can see that the PMS estimators have lower MSE than the unselected estimator roughly for
- MSE,
- MSE,
Figure 28.1: MSE of Post-Model-Selection Estimators
The numerical calculations show that MSE is reduced by selection when
The results of this section may appear to contradict Theorem
27.24 Inference After Model Selection
Economists are typically interested in inferential questions such as hypothesis tests and confidence intervals. If an econometric model has been selected by a procedure such as AIC or CV what are the properties of statistical tests applied to the selected model?
To be concrete, consider the regression model
The more interesting and subtle question is the impact on inference concerning
We illustrate the issue numerically. Suppose that
Figure 28.2: Coverage Probability of Post-Model Selection
We display in Figure
This invariance breaks down for
The message from this display is that inference after model selection is problematic. Conventional inference procedures do not have conventional distributions and the distortions are potentially unbounded.
27.25 Empirical Illustration
We illustrate the model selection methods with an application. Take the CPS dataset and the subsample of Asian women which has
Table 28.1: Estimates of Return to Experience among Asian Women
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 7 | Model 8 | Model 9 | |
---|---|---|---|---|---|---|---|---|---|
Return | |||||||||
s.e. | 7 | 8 | 7 | 11 | 11 | 11 | 17 | 18 | 17 |
BIC | 956 | 924 | 964 | 913 | 931 | 977 | 925 | 943 | |
AIC | 915 | 861 | 858 | 914 | 858 | 916 | 860 | 857 | |
CV | 405 | 387 | 386 | 405 | 385 | 406 | 387 | 386 | |
FIC | 86 | 48 | 53 | 58 | 34 | 86 | 71 | 68 | |
Education | College | Spline | Dummy | College | Spline | Dummy | College | Spline | Dummy |
Experience | 2 | 2 | 2 | 4 | 4 | 4 | 6 | 6 | 6 |
Terms for experience:
Models 1-3 include include experience and its square.
Models 4-6 include powers of experience up to power 4 .
Models 7-9 include powers of experience up to power 6 .
Terms for education: - Models 1, 4, and 7 include a single dummy variable college indicating that years of education is 16 or higher.
Models 2, 5, and 8 include a linear spline in education with a single knot at education=9.
Models 3, 6, and 9 include six dummy variables, for education equalling 12, 13, 14, 16, 18, and 20.
Table
The BIC picks a parsimonious model with the linear spline in education and a quadratic in experience. The AIC and CV select a less parsimonious model with the full dummy specification for education and a
When selecting a model using information criteria it is useful to examine several criteria. In applications, decisions should be made by a combination of judgment as well as the formal criteria. In this case the cross-validation criterion selects model 6 which has the estimate
27.26 Shrinkage Methods
Shrinkage methods are a broad class of estimators which reduce variance by moving an estimator
The simplest shrinkage estimator takes the form
variance
and weighted mean squared error (using the weight matrix
where
wmse
wmse if .wmse
is minimized by the shrinkage weight .The minimized WMSE is wmse
.
For the proof see Exercise
Part 1 of the theorem shows that the shrinkage estimator has reduced WMSE for a range of values of the shrinkage weight
To construct the optimal shrinkage weight we need the unknown
Replacing
This is often called a Stein-Rule estimator.
This estimator has many appealing properties. It can be viewed as a smoothed selection estimator. The quantity
27.27 James-Stein Shrinkage Estimator
James and Stein (1961) made the following discovery.
Theorem 28.12 Assume that
If
then wmse wmse .The WMSE is minimized by setting
and equals
where
This result stunned the world of statistics. Part 1 shows that the shrinkage estimator has strictly smaller WMSE for all values of the parameters and thus dominates the original estimator. The latter is the MLE so this result shows that the MLE is dominated and thus inadmissible. This is a stunning result because it had previously been assumed that it would be impossible to find an estimator which dominates the MLE.
Theorem
The minimizing choice for the shrinkage coefficient
In practice
which is fully feasible as it does not depend on unknowns or tuning parameters. The substitution of
27.28 Interpretation of the Stein Effect
The James-Stein Theorem appears to conflict with classical statistical theory. The original estimator
Part of the answer is that classical theory has caveats. The Cramer-Rao Theorem, for example, restricts attention to unbiased estimators and thus precludes consideration of shrinkage estimators. The James-Stein estimator has reduced MSE, but is not Cramer-Rao efficient because it is biased. Therefore the James-Stein Theorem does not conflict with the Cramer-Rao Theorem. Rather, they are complementary results. On the one hand, the Cramer-Rao Theorem describes the best possible variance when unbiasedness is an important property for estimation. On the other hand, the James-Stein Theorem shows that if unbiasedness is not a critical property but instead MSE is important, then there are better estimators than the MLE.
The James-Stein Theorem may also appear to conflict with our results from Section
The MSE improvements achieved by the James-Stein estimator are greatest when
27.29 Positive Part Estimator
The simple James-Stein estimator has the odd property that it can “over-shrink”. When
where
where
The positive part estimator simultaneously performs “selection” as well as “shrinkage”. If
Consistent with our intuition the positive part estimator has uniformly lower WMSE than the unadjusted James-Stein estimator.
Theorem 28.13 Under the assumptions of Theorem
For a proof see Theorem
In Figure
In summary, the positive-part transformation is an important improvement over the unadjusted James-Stein estimator. It is more reasonable and reduces the mean squared error. The broader message is that imposing boundary conditions on shrinkage weights can improve estimation efficiency.
27.30 Shrinkage Towards Restrictions
The classical James-Stein estimator does not have direct use in applications because it is rare that we wish to shrink an entire parameter vector towards a specific point. Rather, it is more common to shrink a parameter vector towards a set of restrictions. Here are a few examples:
- Shrink a long regression towards a short regression.
Figure 28.3: WMSE of James-Stein Estimator
1. Shrink a regression towards an intercept-only model.
Shrink the regression coefficients towards a set of restrictions.
Shrink a set of estimates (or coefficients) towards their common mean.
Shrink a set of estimates (or coefficients) towards a parametric model.
Shrink a nonparametric series model towards a parametric model.
The way to think generally about these applications it that the researcher wants to allow for generality with the large model but believes that the smaller model may be a useful approximation. A shrinkage estimator allows the data to smoothly select between these two options depending on the strength of information for the two specifications.
Let
The James-Stein estimator with positive-part trimming is
The function
Theorem 28.14 Under the assumptions of Theorem 28.12, if
The shrinkage estimator achieves uniformly smaller MSE if the number of restrictions is three or greater. The number of restrictions
In practice the covariance matrix
and
where
It is insightful to notice that
We can substitute for
In linear regression we have some very convenient simplifications available. In general,
where
27.31 Group James-Stein
The James-Stein estimator can be applied to groups (blocks) of parameters. Suppose we have the parameter vector
where
The group James-Stein estimator separately shrinks each block of coefficients. The advantage relative to the classical James-Stein estimator is that this allows the shrinkage weight to vary across blocks. Some parameter blocks can use a large amount of shrinkage while others a minimal amount. Since the positive-part trimming is used the estimator simultaneously performs shrinkage and selection. Blocks with small effects will be shrunk to zero and eliminated. The disadvantage of the estimator is that the benefits of shrinkage may be reduced because the shrinkage dimension is reduced. The trade-off between these factors will depend on how heterogeneous the optimal shrinkage weight varies across the parameters.
The groups should be selected based on two criteria. First, they should be selected so that the groups separate variables by expected amount of shrinkage. Thus coefficients which are expected to be “large” relative to their estimation variance should be grouped together and coefficients which are expected to be “small” should be grouped together. This will allow the estimated shrinkage weights to vary according to the group. For example, a researcher may expect high-order coefficients in a polynomial regression to be small relative to their estimation variance. Hence it is appropriate to group the polynomial variables into “low order” and “high order”. Second, the groups should be selected so that the researcher’s loss (utility) is separable across groups of coefficients. This is because the optimality theory (given below) relies on the assumption that the loss is separable. To understand the implications of these recommendations consider a wage regression. Our interpretation of the education and experience coefficients are separable if we use them for separate purposes, such as for estimation of the return to education and the return to experience. In this case it is appropriate to separate the education and experience coefficients into different groups.
For an optimality theory we define weighted MSE with respect to the block-diagonal weight matrix
Theorem 28.15 Under the assumptions of Theorem 28.12, if WMSE is defined with respect to
The proof is a simple extension of the classical James-Stein theory. The block diagonal structure of
27.32 Empirical Illustrations
We illustrate James-Stein shrinkage with three empirical applications.
The first application is to the sample used in Section 28.18, the CPS dataset with the subsample of Asian women (
- Experience Profile
- Firm Effects
Figure 28.4: Shrinkage Illustrations
The two least squares estimates are visually distinct. The
The second application is to the Invest1993 data set used in Chapter 17. This is a panel data set of annual observations on investment decisions by corporations. We focus on the firm-specific effects. These are of interest when studying firm heterogeneity and are important for firm-specific forecasting. Accurate estimation of firm effects is challenging when the number of time series observations per firm is small.
To keep the analysis focused we restrict attention to firms which are traded on either the NYSE or AMEX and to the last ten years of the sample (1982-1991). Since the regressors are lagged this means that there are at most nine time-series observations per firm. The sample has a total of
Due to the large number of estimated coefficients in the unrestricted model we use the homoskedastic weight matrix as a simplification. This allows the calculation of the shrinkage weight using the simple formula (28.28) for the statistic
The empirically-determined shrinkage weight is
To report our results we focus on the distribution of the firm-specific effects. For the fixed effects model these are the estimated fixed effects. For the industry-effect model these are the estimated industry dummy coefficients (for each firm). For the Stein Rule estimates they are a weighted average of the two. We estimate
You can see that the fixed-effects estimate of the firm-specific density is more dispersed while the Stein estimator is sharper and more peaked indicating that the fixed effects estimator attributes more variation in firm-specific factors than the Stein estimator. The Stein estimator pulls the fixed effects towards their common mean, adjusting for the randomness due to their estimation. Our expectation is that the Stein estimates, if used for an application such as firm-specific forecasting, will be more accurate because they will have reduced variance relative to the fixed effects estimates.
The third application uses the CPS dataset with the subsample of Black men (
The least squares estimate of the return to education by region is displayed in Figure 28.5(a). For simplicity we combine the omitted education group (less than 12 years education) as “11 years”. The estimates appear noisy due to the small samples. One feature which we can see is that the four lines track one another for years of education between 12 and 18. That is, they are roughly linear in years of education with the same slope but different intercepts.
To improve the precision of the estimates we shrink the four profiles towards Model 6 . This means that we are shrinking the profiles not towards each other but towards the model with the same effect of education but regional-specific intercepts. Again we use the HCl covariance matrix estimate. The number of restrictions is 18 . The empirically-determined shrinkage weight is
The Stein Rule estimates are displayed in Figure 28.5(b). The estimates are less noisy than panel (a) and it is easier to see the patterns. The four lines track each other and are approximately linear over 1218. For 20 years of education the four lines disperse which seems likely due to small samples. In panel (b) it is easier to see the patterns across regions. It appears that the northeast region has the highest wages (conditional on education) while the west region has the lowest wages. This ranking is constant for nearly all levels of education.
While the Stein Rule estimates shrink the nonparametric estimates towards the common-educationfactor specification it does not impose the latter specification. The Stein Rule estimator has the ability to
- Least Squares Estimates
- Stein Rule Estimates
Figure 28.5: Stein Rule Estimation of Education Profiles Across Regions
put near zero weight on the common-factor model. The fact that the estimates put
The message from these three applications is that the James-Stein shrinkage approach can be constructively used to reduce estimation variance in economic applications. These applications illustrate common forms of potential applications: Shrinkage of a flexible specification towards a simpler specification; Shrinkage of heterogeneous estimates towards homogeneous estimates; Shrinkage of fixed effects towards group dummy estimates. These three applications also employed moderately large sample sizes (
27.33 Model Averaging
Recall that the problem of model selection is how to select a single model from a general set of models. The James-Stein shrinkage estimator smooths between two nested models by taking a weighted average of two estimators. More generally we can take an average of an arbitrary number of estimators. These estimators are known as model averaging estimators. The key issue for estimation is how to select the averaging weights.
Suppose we have a set of
Corresponding to the set of models we introduce a set of weights
The probability simplex in
The simplex in
Figure 28.6: Probability Simplex in
Since the weights on the probability simplex sum to one, an alternative representation is to eliminate one weight by substitution. Thus we can set
Given a weight vector we define the averaging estimator
Selection estimators emerge as the special case where the weight vector
It is not absolutely necessary to restrict the weight vector of an averaging estimator to lie in the probability simplex
In Section
The optimal averaging weights, however, are unknown. A number of methods have been proposed for selection of the averaging weights.
One simple method is equal weighting. This is achieved by setting
The advantages of equal weighting are that it is simple, easy to motivate, and no randomness is introduced by estimation of the weights. The variance of the equal weighting estimator can be calculated because the weights are fixed. Another important advantage is that the estimator can be constructed in contexts where it is unknown how to construct empirical-based weights, for example when averaging models from completely different probability families. The disadvantages of equal weighting are that the method can be sensitive to the set of models considered, there is no guarantee that the estimator will perform better than the unrestricted estimator, and sample information is inefficiently used. In practice, equal weighting is best used in contexts where the set of models have been pre-screened so that all are considered “reasonable” models. From the standpoint of econometric methodology equal weighting is not a proper statistical method as it is an incomplete methodology.
Despite these concerns equal weighting can be constructively employed when summarizing information for a non-technical audience. The relevant context is when you have a small number of reasonable but distinct estimates typically made using different assumptions. The distinct estimates are presented to illustrate the range of possible results and the average taken to represent the “consensus” or “recommended” estimate.
As mentioned above, a number of methods have been proposed for selection of the averaging weights. In the following sections we outline four popular methods: Smoothed BIC, Smoothed AIC, Mallows averaging, and Jackknife averaging.
27.34 Smoothed BIC and AIC
Recall that Schwarz’s Theorem
This has been been interpreted to mean that the model with the highest value of the right-hand-side approximately has the highest marginal likelihood and is thus the model with the highest probability of being the true model.
There is another interpretation of Schwarz’s result. The marginal likelihood is approximately proportional to the probability that the model is true, conditional on the data. Schwarz’s Theorem implies that this is approximately
which is a simple exponential transformation of the BIC. Weighting by posterior probability can be achieved by setting model weights proportional to this transformation. These are known as BIC weights and produce the smoothed BIC estimator.
To describe the method completely, we have a set of models
The
Some properties of the BIC weights are as follows. They are non-negative so all models receive positive weight. Some models can receive weight arbitrarily close to zero and in practice many estimated models may receive BIC weight that is essentially zero. The model which is selected by BIC receives the greatest weight and models which have BIC values close to the minimum receive weights closest to the largest weight. Models whose
The Smoothed BIC (SBIC) estimator is
The SBIC estimator is a smoother function of the data than BIC selection as there are no discontinuous jumps across models.
An advantage of the smoothed BIC weights and estimator is that it can be used to combine models from different probability families. As for the BIC it is important that all models are estimated on the same sample. It is also important that the full formula is used for the BIC (no omission of constants) when combining models from different probability families.
Computationally it is better to implement smoothed BIC with what are called “BIC differences” rather than the actual values of the BIC, as the formula as written can produce numerical overflow problems. The difficulty is due to the exponentiation in the formula. This problem can be eliminated as follows. Let
denote the lowest BIC among the models and define the BIC differences
Then
Thus the weights are algebraically identically whether computed on
Because of the properties of the exponential, if
They do not provide a strong theoretical justification for this specific choice of transformation but it seems natural given the smoothed BIC formula and works well in simulations.
The algebraic properties of the AIC weights are similar to those of the BIC weights. All models receive positive weight though some receive weight which is arbitrarily close to zero. The model with the smallest AIC receives the greatest AIC weight, and models with similar AIC values receive similar AIC weights.
Computationally the AIC weights should be computed using AIC differences. Define
The AIC weights algebraically equal
As for the BIC weights
The Smoothed AIC (SAIC) estimator is
The SAIC estimator is a smoother function of the data than AIC selection.
Recall that both AIC selection and BIC selection are model selection consistent in the sense that as the sample size gets large the probability that the selected model is a true model is arbtrarily close to one. Furthermore, BIC is consistent for parsimonious models and AIC asymptotically over-selects.
These properties extend to SBIC and SAIC. In large samples SAIC and SBIC weights will concentrate exclusively on true models; the weight on incorrect models will asymptotically approach zero. However, SAIC will asymptotically spread weight across both parsimonious true models and overparameterized true models, while SBIC asymptotically concentrates weight only on parsimonious true models.
An interesting property of the smoothed estimators is the possibility of asymptotically spreading weight across equal-fitting parsimonious models. Suppose we have two non-nested models with the same number of parameters and the same KLIC value so they are equal approximations. In large samples both SBIC and SAIC will be weighted averages of the two estimators rather than simply selecting one of the two.
27.35 Mallows Model Averaging
In linear regression the Mallows criterion (28.14) applies directly to the model averaging estimator (28.29). The homoskedastic regression model is
Suppose that there are
The model averaging estimator for fixed weights is
where
The model averaging residual is
The estimator
where
In the case of model selection the Mallows penalty is proportional to the number of estimated coefficients. In the model averaging case the Mallows penalty is the average number of estimated coefficients.
The Mallows-selected weight vector is that which minimizes the Mallows criterion. It equals
Computationally it is useful to observe that
The probability simplex
Figure
Figure 28.7: Mallows Weight Selection
Once the weights
In the special case of two nested models the Mallows criterion can be written as
where we assume
This is the same as the Stein Rule weight (28.27) with a slightly different shrinkage constant. Thus the Mallows averaging estimator for
Based on the latter observation, B. E. Hansen (2014) shows that the MMA estimator has lower WMSE than the unrestricted least squares estimator when the models are nested linear regressions, the errors are homoskedastic, and the models are separated by 4 coefficients or greater. The latter condition is analogous to the conditions for improvements in the Stein Rule theory.
B. E. Hansen (2007) showed that the MMA estimator asymptotically achieves the same MSE as the infeasible optimal best weighted average using the theory of Li (1987) under similar conditions. This shows that using model selection tools to select the averaging weights is asymptotically optimal for regression fitting and point forecasting.
27.36 Jackknife (CV) Model Averaging
A disadvantage of Mallows selection is that the criterion is valid only when the errors are conditionally homoskedastic. In constrast, selection by cross-validation does not require homoskedasticity. Therefore it seems sensible to use cross-validation rather than Mallows to select the weight vectors. It turns out that this is a simple extension with excellent finite sample performance. In the Machine Learning literature this method is called stacking.
A fitted averaging regression (with fixed weights) can be written as
where
where
where
This means that the jackknife estimate of variance (or equivalently the cross-validation criterion) equals
which is a quadratic function of the weight vector. The cross-validation choice for weight vector is the minimizer
Given the weights the coefficient estimates (and any other parameter of interest) are found by taking weighted averages of the model estimates using the weight vector
The algebraic properties of the solution are similar to Mallows. Since (28.31) minimizes a quadratic function subject to a simplex constraint, solutions tend to be on edges and vertices which means that many (or most) models receive zero weight. Hence JMA weight selection simultaneously performs selection and shrinkage. The solution is found numerically by quadratic programming which is computationally simple and fast even when the number of models
B. E. Hansen and Racine (2012) showed that the JMA estimator is asymptotically equivalent to the infeasible optimal weighted average across least squares estimates based on a regression fit criteria. Their results hold under quite mild conditions including conditional heteroskedasticity. This result is similar to Andrews (1991c) generalization of Li (1987)’s result for model selection.
The implication of this theory is that JMA weight selection is computationally simple and has excellent sampling performance.
27.37 Granger-Ramanathan Averaging
A method similar to JMA based on hold-out samples was proposed for forecast combination by Granger and Ramanathan (1984), and has emerged as a popular method in the modern machine learning literature.
Randomly split the sample into two parts: an estimation an an evaluation sample. Using the estimation sample, estimate the
The least squares coefficients
Based on an informal argument Granger and Ramanathan (1984) recommended an unconstrained least squares regression to obtain the weights but this is not advised as this produces extremely erratic empirical weights, especially when
This Granger-Ramanathan approach is best suited for applications with a very large sample size where the efficiency loss from the hold-out sample split is not a concern.
27.38 Empirical Illustration
We illustrate the model averaging methods with the empirical application from Section 28.18, which reported wage regression estimates for the CPS sub-sample of Asian women focusing on the return to experience between 0 and 30 years.
Table
The results show that the methods put weight on somewhat different models. The SBIC puts nearly all weight on model 2 . The SAIC puts nearly
The averaging estimators from the non-BIC methods are similar to one another but SBIC produces a much smaller estimate than the other methods.
Table 28.2: Model Averaging Weights and Estimates of Return to Experience among Asian Women
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 7 | Model 8 | Model 9 | Return | |
---|---|---|---|---|---|---|---|---|---|---|
SBIC | ||||||||||
SAIC | ||||||||||
MMA | ||||||||||
JMA |
27.39 Technical Proofs*
Proof of Theorem 28.1 We establish the theorem under the simplifying assumptions of the normal linear regression model with a
Evaluated at the MLE
Using (8.21) we can write
For a diffuse prior
where the final equality is the multivariate normal integral. Rewriting and taking logs
This is the theorem.
Proof of Theorem 28.2 From (28.11)
Thus
This is (28.12). The final equality holds under the assumption of conditional homoskedasticity.
Evaluating (28.11) at
This has expectation
This is (28.13). The final equality holds under conditional homoskedasticity.
Proof of Theorem 28.4 The proof uses Taylor expansions similar to those used for the asymptotic distribution theory of the MLE in nonlinear models. We avoid technical details so this is not a full proof.
Write the model density as
where
Define the Hessian
The second equality holds because of the first-order condition for the MLE
where
and the Wald statistic on the left-side of (28.35) is uniformly integrable. The asymptotic convergence (28.35) holds for the MLE under standard regularity conditions (including correct specification).
Proof of Theorem 28.5 Using matrix notation we can write
Notice that this calculation relies on the assumption of conditional homoskedasticity.
Now consider the Mallows criterion. We find that
Taking expectations and using the assumptions of conditional homoskedasticity and
This is the result as stated.
Proof of Theorem 28.6 Take any two models
Model
where
Since this holds for any
Proof of Theorem 28.7 Take the setting as described in the proof of Theorem
Letting
because
Proof of Theorem
Since
Proof of Theorem 28.9 First, we examine
The choice of regressors affects
The next step is to show that
as
Pick
By a similar argument but using Whittle’s inequality (B.49),
Together these imply (28.36).
Finally we show that (28.36) implies (28.18). The argument is similar to the standard consistency proof for nonlinear estimators. (28.36) states that
The second inequality is (28.36). The following uses the facts that
Before providing the proof of Theorem
where
Theorem 28.16 The non-central chi-square density function (28.37) obeys the recursive relationship
The proof of Theorem
The second technical result is from Bock (1975, Theorems A&B).
Theorem 28.17 If
where
Proof of Theorem 28.17 To show (28.38) we first show that for
Assume
By expansion and Legendre’s duplication formula
Then (28.41) equals
where
Take the
which is (28.38). The final equality uses the fact that
Observe that
which is (28.39).
Proof of Theorem 28.10 By the quadratic structure we can calculate that
The third equality uses the two results from Theorem 28.17, setting
27.40 Exercises
Exercise 28.1 Verify equations (28.1)-(28.2).
Exercise 28.2 Find the Mallows criterion for the weighted least squares estimator of a linear regression
Exercise 28.3 Backward Stepwise Regression. Verify the claim that for the case of AIC selection, step (b) of the algorithm can be implemented by calculating the classical (homoskedastic) t-ratio for each active regressor and find the regressor with the smallest absolute t-ratio.
Hint: Use the relationship between likelihood ratio and F statistics and the equality between
Exercise 28.4 Forward Stepwise Regression. Verify the claim that for the case of AIC selection, step (b) of the algorithm can be implemented by identifying the regressor in the inactive set with the greatest absolute correlation with the residual from step (a).
Hint: This is challenging. First show that the goal is to find the regressor which will most decrease SSE
Exercise 28.5 An economist estimates several models and reports a single selected specification, stating that “the other specifications had insignificant coefficients”. How should we interpret the reported parameter estimates and t-ratios?
Exercise 28.6 Verify Theorem 28.11, including (28.21), (28.22), and (28.23).
Exercise 28.7 Under the assumptions of Theorem 28.11, show that
Exercise 28.8 Prove Theorem
Extra challenge: Show under these assumptions that
Exercise 28.9 Suppose you have two unbiased estimators
- Show that the optimal weighted average estimator equals
Generalize to the case of
unbiased uncorrelated estimators.Interpret the formulae. Exercise 28.10 You estimate
linear regressions by least squares. Let be the fitted values.Show that the Mallows averaging criterion is the same as
- Assume the models are nested with
the largest model. If the previous criterion were minimized over in the probability simplex but the penalty was omitted, what would be the solution? (What would be the minimizing weight vector?)
Exercise 28.11 You estimate
Exercise 28.12 Using the cps09mar dataset perform an analysis similar to that presented in Section