Introduction
Multivariate regression is a system of regression equations. Multivariate regression is used as reduced form models for instrumental variable estimation (Chaper 12), vector autoregressions (Chapter 15), demand systems (demand for multiple goods), and other contexts.
Multivariate regression is also called by the name systems of regression equations. Closely related is the method of Seemingly Unrelated Regressions (SUR) introduced in Section 11.7.
Most of the tools of single equation regression generalize to multivariate regression. A major difference is a new set of notation to handle matrix estimators.
Regression Systems
A univariate linear regression equation equals where is scalar and is a vector. Multivariate regression is a system of linear regressions, and equals
for . Here we use the subscript to denote the dependent variable, not the individual. As an example, could be expenditures by a household on good category (e.g., food, housing, transportation, clothing, recreation). The regressor vectors are and is an error. The coefficient vectors are . The total number of coefficients are . The regressors can be common across or can vary across . In the household expenditure example the regressors are typically common across , and include variables such as household income, number and ages of family members, and demographic characteristics. The regression system specializes to univariate regression when .
Define the error vector and its covariance matrix . The diagonal elements are the variances of the errors and the off-diagonals are the covariances across variables.
We can group the equations (11.1) into a single equation as follows. Let be the vector of dependent variables. Define the matrix of regressors
and the stacked coefficient vector
The regression equations can be jointly written as
This is a system of equations.
For observations the joint system can be written in matrix notation by stacking. Define
which are , and , respectively. The system can be written as .
In many applications the regressor vectors are common across the variables , so and . By this we mean that the same variables enter each equation with no exclusion restrictions. Several important simplifications occur in this context. One is that we can write (11.2) using the notation
where is . Another is that we can write the joint system of observations in the matrix notation where
Another convenient implication of common regressors is that we have the simplification
where is the Kronecker product (see Appendix A.21).
Least Squares Estimator
The equations (11.1) can be estimated by least squares. This takes the form
An estimator of is the stacked vector
We can alternatively write this estimator using the systems notation
To see this, observe that
and
Hence
as claimed. The residual vector for the observation is . The least squares estimator of the error covariance matrix is
In the case of common regressors, the least squares coefficients can be written as
and
In Stata, multivariate regression can be implemented using the mvreg command.
Expectation and Variance of Systems Least Squares
We can calculate the finite-sample expectation and variance of under the conditional expectation assumption
where is the union of the regressors . Equation (11.7) is equivalent to , which means that the regression model is correctly specified.
We can center the estimator as
Taking conditional expectations we find . Consequently, systems least squares is unbiased under correct specification.
To compute the variance of the estimator, define the conditional covariance matrix of the errors of the observation which in general is a function of . If the observations are mutually independent then
Also, by independence across observations,
It follows that
When the regressors are common so that then the covariance matrix can be written as
If the errors are conditionally homoskedastic
then the covariance matrix simplifies to
If both simplifications (common regressors and conditional homoskedasticity) hold then we have the considerable simplication
Asymptotic Distribution
For an asymptotic distribution it is sufficient to consider the equation-by-equation projection model in which case
First, consider consistency. Since are the standard least squares estimators, they are consistent for the projection coefficients .
Second, consider the asymptotic distribution. Our single equation theory implies that the are asymptotically normal. But this theory does not provide a joint distribution of the across , which we now derive. Since the vector
is i.i.d. across and mean zero under (11.9), the central limit theorem implies
where
The matrix is the covariance matrix of the variables across equations. Under conditional homoskedasticity (11.8) the matrix simplifies to
(see Exercise 11.1). When the regressors are common it simplies to
(see Exercise 11.2). Under both conditions (homoskedasticity and common regressors) it simplifies to
(see Exercise 11.3).
Applied to the centered and normalized estimator we obtain the asymptotic distribution. Theorem 11.1 Under Assumption 7.2, where and
For a proof, see Exercise 11.4.
When the regressors are common the matrix simplifies as
(See Exercise 11.5).
If both the regressors are common and the errors are conditionally homoskedastic (11.8) then we have the simplification
(see Exercise 11.6).
Sometimes we are interested in parameters which are functions of the coefficients from multiple equations. In this case the least squares estimator of is . The asymptotic distribution of can be obtained from Theorem by the delta method.
Theorem 11.2 Under Assumptions and where and
For a proof, see Exercise 11.7.
Theorem is an example where multivariate regression is fundamentally distinct from univariate regression. Only by treating least squares as a joint estimator can we obtain a distributional theory for a function of multiple equations. We can thereby construct standard errors, confidence intervals, and hypothesis tests.
Covariance Matrix Estimation
From the finite sample and asymptotic theory we can construct appropriate estimators for the variance of . In the general case we have
Under conditional homoskedasticity (11.8) an appropriate estimator is
When the regressors are common then these estimators equal
and , respectively.
Covariance matrix estimators for are found as
Theorem 11.3 Under Assumption 7.2, and
For a proof, see Exercise 11.8.
Seemingly Unrelated Regression
Consider the systems regression model under the conditional expectation and homoskedasticity assumptions
Since the errors are correlated across equations we consider estimation by Generalized Least Squares (GLS). To derive the estimator, premultiply (11.15) by so that the transformed error vector is i.i.d. with covariance matrix . Then apply least squares and rearrange to find
(see Exercise 11.9). Another approach is to take the vector representation
and calculate that the equation error has variance . Premultiply the equation by so that the transformed error has covariance matrix and then apply least squares to find
(see Exercise 11.10). Expressions (11.16) and (11.17) are algebraically equivalent. To see the equivalence, observe that
and
Since is unknown it must be replaced by an estimator. Using from (11.5) we obtain a feasible GLS estimator.
This is the Seemingly Unrelated Regression (SUR) estimator as introduced by Zellner (1962).
The estimator can be updated by calculating the SUR residuals and the covariance matrix estimator . Substituted into (11.18) we obtain an iterated SUR estimator. This can be iterated until convergence.
Under conditional homoskedasticity (11.8) we can derive its asymptotic distribution.
Theorem 11.4 Under Assumption and (11.8)
where .
For a proof, see Exercise 11.11.
Under these assumptions, SUR is more efficient than least squares.
Theorem 11.5 Under Assumption and (11.8)
and thus is asymptotically more efficient than . For a proof, see Exercise 11.12.
An appropriate estimator of the variance of is
Theorem 11.6 Under Assumption and (11.8) .
For a proof, see Exercise 11.13.
In Stata, the seemingly unrelated regressions estimator is implemented using the sureg command.

Equivalence of SUR and Least Squares
When the regressors are common across equations it turns out that the SUR estimator simplifies to least squares.
To see this, recall that when regressors are common this implies that . Then
Thus
A model where regressors are not common across equations is nested within a model with the union of all regressors included in all equations. Thus the model with regressors common across equations is a fully unrestricted model, and a model where the regressors differ across equations is a restricted model. Thus the above result shows that the SUR estimator reduces to least squares in the absence of restrictions, but SUR can differ from least squares otherwise.
Another context where SUR=OLS is when the variance matrix is diagonal, . In this case from which you can calculate that . The intuition is that there is no difference in systems estimation when the equations are uncorrelated, which occurs when is diagonal.
Maximum Likelihood Estimator
Take the linear model under the assumption that the error is independent of the regressors and multivariate normally distributed. Thus with . In this case we can consider the maximum likelihood estimator (MLE) of the coefficients.
It is convenient to reparameterize the covariance matrix in terms of its inverse . With this reparameterization the conditional density of given equals
The log-likelihood function for the sample is
The maximum likelihood estimator maximizes the log-likelihood function. The first order conditions are
and
The second equation uses the matrix results and from Appendix A.20.
Solving and making the substitution we obtain
Notice that each equation refers to the other. Hence these are not closed-form expressions but can be solved via iteration. The solution is identical to the iterated SUR estimator. Thus the iterated SUR estimator is identical to MLE under normality.
Recall that the SUR estimator simplifies to OLS when the regressors are common across equations. The same occurs for the MLE. Thus when we find that and .
Restricted Estimation
In many multivariate regression applications it is desired to impose restrictions on the coefficients. In particular, cross-equation restrictions (for example, imposing Slutsky symmetry on a demand system) can be quite important and can only be imposed by a multivariate estimation method. Estimation subject to restrictions can be done by minimum distance, maximum likelihood, or the generalized method of moments.
Minimum distance is a straightforward application of the methods of Chapter 8 to the estimators presented in this chapter, as such methods apply to any asymptotically normal estimator.
Imposing restrictions on maximum likelihood is also straightforward. The likelihood is maximized subject to the imposed restrictions. One important example is explored in detail in the following section.
Generalized method of moments estimation of multivariate regression subject to restrictions will be explored in Section 13.18. This is a particularly simple and straightforward way to estimate restricted multivariate regression models and is our generally preferred approach.
Reduced Rank Regression
One context where systems estimation is important is when it is desired to impose or test restrictions across equations. Restricted systems are commonly estimated by maximum likelihood under normality. In this section we explore one important special case of restricted multivariate regression known as reduced rank regression. The model was originally proposed by Anderson (1951) and extended by Johansen (1995).
The unrestricted model is
where is is , and . We separate the regressors as and because the coefficient matrix will be restricted while will be unrestricted.
The matrix is full rank if
The reduced rank restriction is for some known .
The reduced rank restriction implies that we can write the coefficient matrix in the factored form where is and is . This representation is not unique as we can replace with and with for any invertible and the same relation holds. Identification therefore requires a normalization of the coefficients. A conventional normalization is for given .
Equivalently, the reduced rank restriction can be imposed by requiring that satisfy the restriction for some coefficient matrix . Since is full rank this requires that , hence is the orthogonal complement of . Note that is not unique as it can be replaced by for any invertible . Thus if is to be estimated it requires a normalization.
We discuss methods for estimation of , and . The standard approach is maximum likelihood under the assumption that . The log-likelihood function for the sample is
Anderson (1951) derived the MLE by imposing the constraint via the method of Lagrange multipliers. This turns out to be algebraically cumbersome.
Johansen (1995) instead proposed the following straightforward concentration method. Treating as if it is known, maximize the log-likelihood with respect to the other parameters. Resubstituting these estimators we obtain the concentrated log-likelihood function with respect to . This can be maximized to find the MLE for . The other parameter estimators are then obtain by substitution. We now describe these steps in detail.
Given the likelihood is a normal multivariate regression in the variables and , so the MLE for and are least squares. In particular, using the Frisch-Waugh-Lovell residual regression formula we can write the estimators for and as
and
where and .
Substituting these estimators into the log-likelihood function we obtain the concentrated likelihood function, which is a function of only.
The third equality uses Theorem A.1.8. The MLE for is the maximizer of , or equivalently equals
which are the generalized eigenvectors of with respect to corresponding to the largest generalized eigenvalues. (Generalized eigenvalues and eigenvectors are discussed in Section A.14.) The estimator satisfies the normalization . Letting denote the eigenvectors of (11.20) we can also express .
This is computationally straightforward. In MATLAB, for example, the generalized eigenvalues and eigenvectors of a matrix with respect to are found using the command eig .
Given , the MLE are found by least squares regression of on and . In particular, because We now discuss the estimator of . It turns out that
the eigenvectors of with respect to associated with the largest eigenvalues.
By the dual eigenvalue relation (Theorem A.5), equations (11.20) and (11.21) have the same non-zero eigenvalues and the associated eigenvectors and satisfy the relationship
Letting this implies
The second equality holds because and . Since the eigenvectors satisfy the orthogonality property for , it follows that
Since we conclude that as desired.
The solution in (11.21) can be represented several ways. One which is computationally convenient is to observe that
where and is the residual matrix from the unrestricted multivariate least squares regression of on and . The first equality follows by the FrischWaugh-Lovell theorem. This shows that are the generalized eigenvectors of with respect to corresponding to the largest eigenvalues. In MATLAB, for example, these can be computed using the eig command.
Another representation is to write so that
We summarize our findings. Theorem 11.7 The MLE for the reduced rank model (11.19) under is given as follows. Let and be the residual matrices from multivariate regression of and on , respectively. Then , the generalized eigenvectors of with respect to corresponding to the largest eigenvalues and are obtained by the least squares regression
Let be the residual matrix from a multivariate regression of on and . Then equals the generalized eigenvectors of with respect to corresponding to the smallest eigenvalues. The maximized likelihood equals
An R package for reduced rank regression is “RRR”. I am unaware of a Stata command.
Principal Component Analysis
In Section we described the Duflo, Dupas, and Kremer (2011) dataset which is a sample of Kenyan first grade test scores. Following the authors we focused on the variable totalscore which is each student’s composite test score. If you examine the data file you will find other pieces of information about the students’ performance, including each student’s score on separate sections of the test, with the labels wordscore (word recognition), sentscore (sentence recognition), letterscore (letter recognition), spellscore (spelling), additions_score (addition), substractions_score (subtraction), multiplications_score (multiplication). The “total” score sums the scores from the individual sections. Perhaps there is more information in the section scores. How can we learn about this from the data?
Principal component analysis (PCA) addresses this issue by ordering linear combinations by their contribution to variance. Definition Let be a random vector.
The first principal component is where satisfies
The second principal component is where
In general, the principal component is where
The principal components of are linear combinations ranked by contribution to variance. By the properties of quadratic forms (Section A.15) the weight vectors are the eigenvectors of .
Theorem 11.8 The principal components of are , where is the eigenvector of associated with the ordered eigenvalue of .
Another way to see the PCA construction is as follows. Since is symmetric the spectral decomposition (Theorem A.3) states that where and are the eigenvectors and eigenvalues of . Since is positive semi-definite the eigenvalues are real, non-negative, and ordered . Let be the principal components of . By Theorem 11.8, . The covariance matrix of is
which is diagonal. This shows that and the principal components are mutually uncorrelated. The relative variance contribution of the principal component is .
Principal components are sensitive to the scaling of . Consequently, it is recommended to first scale each element of to have mean zero and unit variance. In this case is a correlation matrix.
The sample principal components are obtained by replacing the unknowns by sample estimators. Let be the sample covariance or correlation matrix and its ordered eigenvectors. The sample principal components are .
To illustrate we use the Duflo, Dupas, and Kremer (2011) dataset. In Table we display the seven eigenvalues of the sample correlation matrix for the seven test scores described above. The seven eigenvalues sum to seven because we have applied PCA to the correlation matrix. The first eigenvalue is , implying that the first principal component explains of the variance of the seven test scores. The second eigenvalue is , implying that the second principal component explains of the variance. Together the first two components explain of the variance of the seven test scores.
In Table we display the weight vectors (eigenvectors) for the first two principal components. The weights for the first component are all positive and similar in magnitude. This means that the first Table 11.1: Eigenvalue Decomposition of Sample Correlation Matrix
|Eigenvalue|Proportion|
|:|———-|———-| |1| | | |2| | | |3| | | |4| | | |5| | | |6| | | |7| | |
Table 11.2: Principal Component Weight Vectors
words |
|
|
sentences |
|
|
letters |
|
|
spelling |
|
|
addition |
|
|
subtraction |
|
|
multiplication |
|
|
principal component is similar to a simple average of the seven test scores. This is quite fascinating. This is consistent with our intuition that a simple average (e.g. the variable totalscore) captures most of the information contained in the seven test scores. The weights for the second component have a different pattern. The four literacy scores receive negative weight and the three math scores receive positive weight with similar magnitudes. This means that the second principal component is similar to the difference between a student’s math and verbal test scores. Taken together, the information in the first two principal components is equivalent to “average verbal” and “average math” test scores. What this shows is that of the variation in the seven section test scores can be explained by a simple average (e.g. totalscore), and can be explained by averages for the verbal and math halves of the test.
In Stata, principal components analysis can be implemented with the pca command. In use prcomp or princomp. All three can be applied to either covariance matrices (unscaled data) or correlation matrices (normalized data) but they have different default settings. The Stata pca command by default normalizes the observations. The R commands by default do not normalize the observations.
Factor Models
Closely related to principal components are factor models. These are statistical models which decompose random vectors into common factors and idiosyncratic errors. Factor models are popular throughout the social sciences. Consequently a variety of estimation methods have been developed. In the next few sections we focus on methods which are popular among economists.
Let be a random vector (for example the seven test scores described in the previous section). Assume that the elements of are scaled to have mean zero and unit variance.
A single factor model for is
where are factor loadings, is a common factor, and is a random error. The factor is individual-specific while the coefficient is common across individuals. The model (11.22) specifies that correlation between the elements of is due to the common factor . In the student test score example it is intuitive to think of as a student’s scholastic “aptitude”; in this case the vector describes how scholastic aptitude affects the seven subject scores.
A multiple factor model has factors. We write the model as
where is a matrix of factor loadings and is an vector of factors. In the student test score example possible factors could be “math aptitude”, “language skills”, “social skills”, “artistic ability”, “creativity”, etc. The factor loading matrix indicates the effect of each factor on each test score. The number of factors is taken as known. We discuss selection of later.
The error vector is assumed to be mean zero, uncorrelated with , and (under correct specification) to have mutually uncorrelated elements. We write its covariance matrix as . The factor vector can either be treated as random or as a regressor. In this section we treat as random; in the next we treat as regressors. The random factors are assumed mean zero and are normalized so that
The assumptions imply that the correlation matrix equals
The factor analysis literature often describes as the communality and the idiosyncratic error matrix as the uniqueness. The former is the portion of the variance which is explained by the factor model and the latter is the unexplained portion of the variance.
The model is often estimated by maximum likelihood. Under joint normality of the distribution of is . The parameters are and . The log-likelihood function of a random sample is
The maximizes . There is not an algebraic solution so the estimator is found using numerical methods. Fortunately, computational algorithms are available in standard packages. A detailed description and analysis can be found in Anderson (2003, Chapter 14).
The form of the log-likelihood is intriguing. Notice that the log-likelihood is only a function of the observations through its correlation matrix , and only a function of the parameters through the population correlation matrix . The final term in (11.25) is a measure of the match between and . Together, we see that the Gaussian log-likelihood is essentially a measure of the fit of the model and sample correlation matrices. It is therefore not reliant on the normality assumption.
It is often of interest to estimate the factors . Given the equation can be viewed as a regression with coefficient . Its least squares estimator is . The GLS estimator (taking into account the covariance matrix of is . This motivates the Bartlett scoring estimator
The idealized version satisfies
There are other estimators used in applied factor analysis. However there is little reason to consider estimators beyond the MLE of this section and the principal components estimator of the next section. which is unbiased for and has variance . Thus the Barlett scoring estimator is typically described as “unbiased” though this is actually a property of its idealized version .
A second estimator for the factors can be constructed from the multivariate linear projection of on . This is where the coefficient matrix is . The coefficient matrix equals
the second equation using . The predicted value of is . This motivates the regression scoring estimator
The idealized version has conditional expectation and is thus biased for . Hence the regression scoring estimator is often described as “biased”. Some algebraic manipulations reveal that has MSE which is smaller (in a positive definite sense) than the MSE of the idealized Bartlett estimator .
Which estimator is preferred, Bartlett or regression scoring? The differences diminish when is large so the choice is most relevant for small to moderate . The regression scoring estimator has lower approximate MSE, meaning that it is a more precise estimator. Thus based on estimation precision this is our recommended choice.
The factor loadings and factors are not separately identified. To see this, notice that if you replace with and where is and orthonormal then the regression model is identical. Such replacements are called “rotations” in the factor analysis literature. Any orthogonal rotation of the factor loadings is an equally valid representation. The default MLE outputs are one specific rotation; others can be obtained by a variety of algorithms (which we do not review here). Consequently it is unwise to attribute meaning to the individual factor loading estimates.
Another important and tricky issue is selection of the number of factors . There is no clear guideline. One approach is to examine the principal component decomposition, look for a division between the “large” and “small eigenvalues, and set to equal to the number of”large” eigenvalues. Another approach is based on testing. As a by-product of the MLE (and standard package implementations) we obtain the LR test for the null hypothesis of factors against the alternative hypothesis of factors. If the LR test rejects (has a small p-value) this is evidence that the given may be too small.
In Stata, the can be calculated with the factor, ml factors (r) command. The factor estimates and can be calculated by the predict command with either the barlett or regression option, respectively. In , the command factanal ( , factors=r, rotation=“none”) calculates the and also calculates the factor estimates and/or using the scores option.
Approximate Factor Models
The MLE of the previous section is a good choice for factor estimation when the number of variables is small and the factor model is believed to be correctly specified. In many economic applications of factor analysis, however, the number of variables is is large. In such contexts the MLE can be computationally costly and/or unstable. Furthermore it is typically not credible to believe that the model is correctly specified; rather it is more reasonable to view the factor model as a useful approximation. In this section we explore an approach known as the approximate factor model with estimation by principal components. The estimation method is justified by an asymptotic framework where the number of variables The approximate factor model was introduced by Chamberlain and Rothschild (1983). It is the same as (11.23) but relaxes the assumption on the idiosyncratic error so that the covariance matrix is left unrestricted. In this context the Gaussian MLE of the previous section is misspecified.
Chamberlain and Rothschild (and the literature which followed) proposed estimation by least squares. The idea is to treat the factors as unknown regressors and simultaneously estimate the factors and factor loadings . We first describe the estimation method.
Let be a sample centered at sample means. The least squares criterion is
Let be the joint minimizers. As and are not separately identified a normalization is needed. For compatibility with the notation of the previous section we use .
We use a concentration argument to find the solution. As described in the previous section, each observation satisfies the multivariate equation . For fixed this is a set of equations with unknowns . The least squares solution is . Substituting this expression into the least squares criterion the concentrated least squares criterion for is
where is the sample covariance matrix. The least squares estimator minimizes this criterion. Let and be first eigenvalues and eigenvectors of . Using the normalization , from the extrema results of Section A.15 the minimizer of the least squares criterion is . More broadly any rotation of is valid. Consider . Recall the expression for the factors . We find that the estimated factors are
We calculate that
which is the desired normalization. This shows that the rotation produces factor estimates satisfying this normalization.
We have proven the following result.
Theorem 11.9 The least squares estimator of the factor model (11.23) under the normalization has the following solution:
Let and be the first eigenvalues and eigenvectors of the sample covariance matrix .
.
. Theorem shows that the least squares estimator is based on an eigenvalue decomposition of the covariance matrix. This is computationally stable even in high dimensions.
The factor estimates are the principal components scaled by the eigenvalues of . Specifically, the factor estimate is . Consequently many authors call this estimator the “principalcomponent method”.
Unfortunately, is inconsistent for if is fixed, as we now show. By the WLLN and CMT, and , the first eigenvectors of . When is diagonal, the eigenvectors of do not lie in the range space of except in the special case . Consequently the estimator is inconsistent.
This inconsistency should not be viewed as surprising. The sample has a total of observations and the model has a total of parameters. Since the number of estimated pararameters is proportional to sample size we should not expect estimator consistency.
As first recognized by Chamberlain and Rothschild, this deficiency diminishes as increases. Specifically, assume that as . One implication is that the number of observations increase at a rate faster than , while the number of parameters increase at a rate proportional to . Another implication is that as increases there is increasing information about the factors.
To make this precise we add the following assumption. Let and denote the smallest and largest eigenvalues of a positive semi-definite matrix .
Assumption As
.
as .
Assumption 11.1.1 bounds the covariance matrix of the idiosyncratic errors. When this is the same as bounding the individual variances. Effectively Assumption 11.1.1 means that while the elements of can be correlated they cannot have a correlation structure similar to that of a factor model. Assumption 11.1.2 requires the factor loading matrix to increase in magnitude as the number of variables increases. This is a fairly mild requirement. When the factor loadings are of similar magnitude across variables, . Conceptually, Assumption 11.1.2 requires additional variables to add information about the unobserved factors.
Assumption implies that in the covariance matrix factorization the component dominates as increases. This means that for large the first eigenvectors of are equivalent to those of , which are in the range space of . This observation led Chamberlain and Rothschild (1983) to deduce that the principal components estimator is an asymptotic (large ) analog estimator for the factor loadings and factors. Bai (2003) demonstrated that the estimator is consistent as jointly. The conditions and proofs are technical so are not reviewed here.
Now consider the estimated factors
where for simplicity we ignore estimation error. Since and we can write this as
This shows that is an unbiased estimator for and has variance . Under Assumption 11.1, . Thus is consistent for as . Bai (2003) shows that this extends to the feasible estimator as .
In Stata, the least squares estimator and factors can be calculated with the factor, pcf factors (r) command followed by predict. In a feasible estimation approach is to calculate the factors by eigenvalue decomposition.
Factor Models with Additional Regressors
Consider the model
where and are is is is , and is .
The coefficients and can be estimated by a combination of factor regression (either MLE or principal components) and least squares. The key is the following two observations:
Given , the coefficient can be estimated by factor regression applied to .
Given the factors , the coefficients and can be estimated by multivariate least squares of on and .
Estimation iterates between these two steps. Start with a preliminary estimator of obtained by multivariate least squares of on . Then apply the above two steps and iterate under convergence.
Factor-Augmented Regression
In the previous sections we considered factor models which decompose a set of variables into common factors and idiosyncratic errors. In this section we consider factor-augmented regression, which uses such common factors as regressors for dimension reduction.
Suppose we have the variables where , and . In practice, may be large and the elements of may be highly correlated. The factor-augmented regression model is
The random variables are , and . The regression coefficients are and . The matrix are the factor loadings.
This model specifies that the influence of on is through the common factors . The idea is that the variation in the regressors is mostly captured by the variation in the factors, so the influence of the regressors can be captured through these factors. This can be viewed as a dimension-reduction technique as we have reduced the -dimensional to the -dimensional . Interest typically focuses on the regressors and its coefficients . The factors are included in the regression as “controls” and its coefficient is less typically of interest. Since it is difficult to interpret the factors only their range space is identified it is generally prudent to avoid intrepreting the coefficients . The model is typically estimated in multiple steps. First, the factor loadings and factors are estimated by factor regression. In the case of principal-components estimation the factor estimates are the scaled principal components . Second, is regressed on the estimated factors and the other regressors to obtain the estimator of and . This second-step estimator equals (for simplicity assume there is no )
Now let’s investigate its asymptotic behavior. As and so
Recall and . We calculate that
We find that the right-hand-side of (11.26) equals
which does not equal . Thus has a probability limit but is inconsistent for as .
This deficiency diminishes as . Indeed,
as . This implies . Hence, if we take the sequential asymptotic limit followed by , we find . This implies that the estimator is consistent. Bai (2003) demonstrated consistency under the more rigorous but technically challenging setting where jointly. The implication of this result is that factor augmented regression is consistent if both the sample size and dimension of are large.
For asymptotic normality of it turns out that we need to strengthen Assumption 11.1.2. The relevant condition is . This is similar to the condition that . This is technical but can be interpreted as meaning that is large relative to . Intuitively, this requires that dimension of is larger than sample size .
In Stata, estimation takes the following steps. First, the factor command is used to estimate the factor model. Either MLE or principal components estimation can be used. Second, the predict command is used to estimate the factors, either by Barlett or regression scoring. Third, the factors are treated as regressors in an estimated regression.
Multivariate Normal*
Some interesting sampling results hold for matrix-valued normal variates. Let be an matrix whose rows are independent and distributed . We say that is multivariate matrix normal, and
The unscaled principal components can equivalently be used if the coefficients are not reported. The coefficient estimates are unaffected by the choice of factor scaling. write , where is with each row equal to . The notation is due to the fact that
Definition 11.2 If then is distributed Wishart with degress of freedom and covariance matrix , and is written as .
The Wishart is a multivariate generalization of the chi-square. If then .
The Wishart arises as the exact distribution of a sample covariance matrix in the normal sampling model. The bias-corrected estimator of is
Theorem 11.10 If are independent then .
The following manipulation is useful.
Theorem 11.11 If then for
To prove this, note that without loss of generality we can take and . Let be orthonormal with first row equal to . so that . Since the distribution of and are identical we can without loss of generality set . Partition where is is , and they are independent. Then
where . The final distributional equality holds conditional on by the same argument in the proof of Theorem 5.7. Since this does not depend on it is the unconditional distribution as well. This establishes the stated result.
To test hypotheses about a classical statistic is known as Hotelling’s :
Theorem 11.12 If then
a scaled F distribution.
To prove this recall that is independent of . Apply Theorem with . Conditional on and using the fact that ,
Since the two chi-square variables are independent, this is the stated result.
A very interesting property of this result is that the statistic is a multivariate quadratric form in normal random variables, yet it has the exact distribution.
Exercises
Exercise 11.1 Show (11.10) when the errors are conditionally homoskedastic (11.8).
Exercise 11.2 Show (11.11) when the regressors are common across equations .
Exercise 11.3 Show (11.12) when the regressors are common across equations and the errors are conditionally homoskedastic (11.8).
Exercise 11.4 Prove Theorem 11.1.
Exercise 11.5 Show (11.13) when the regressors are common across equations .
Exercise 11.6 Show (11.14) when the regressors are common across equations and the errors are conditionally homoskedastic (11.8).
Exercise 11.7 Prove Theorem 11.2.
Exercise 11.8 Prove Theorem 11.3.
Exercise Show that (11.16) follows from the steps described.
Exercise 11.10 Show that (11.17) follows from the steps described.
Exercise Prove Theorem 11.4. Exercise 11.12 Prove Theorem 11.5.
Hint: First, show that it is sufficient to show that
Second, rewrite this equation using the transformations and , and then apply the matrix Cauchy-Schwarz inequality (B.33).
Exercise 11.13 Prove Theorem 11.6.
Exercise Take the model
where is scalar, is a vector and is an vector. and are and is . The sample is with unobserved.
Consider the estimator for by OLS of on where is the OLS coefficient from the multivariate regression of on .
Show that is consistent for .
Find the asymptotic distribution as assuming that .
Why is the assumption an important simplifying condition in part (b)?
Using the result in (c) construct an appropriate asymptotic test for the hypothesis .
Exercise The observations are i.i.d., . The dependent variables and are real-valued. The regressor is a -vector. The model is the two-equation system
What are the appropriate estimators and for and ?
Find the joint asymptotic distribution of and .
Describe a test for .