In Chapter 5 we briefly introduced hypothesis testing in the context of the normal regression model. In this chapter we explore hypothesis testing in greater detail with a particular emphasis on asymptotic inference. For more detail on the foundations see Chapter 13 of Probability and Statistics for Economists.
Hypotheses
In Chapter 8 we discussed estimation subject to restrictions, including linear restrictions (8.1), nonlinear restrictions (8.44), and inequality restrictions (8.49). In this chapter we discuss tests of such restrictions.
Hypothesis tests attempt to assess whether there is evidence contrary to a proposed restriction. Let be a parameter of interest where is some transformation. For example, may be a single coefficient, e.g. , the difference between two coefficients, e.g. , or the ratio of two coefficients, e.g. .
A point hypothesis concerning is a proposed restriction such as
where is a hypothesized (known) value.
More generally, letting be the parameter space, a hypothesis is a restriction where is a proper subset of . This specializes to (9.1) by setting .
In this chapter we will focus exclusively on point hypotheses of the form (9.1) as they are the most common and relatively simple to handle.
The hypothesis to be tested is called the null hypothesis.
Definition 9.1 The null hypothesis is the restriction or .
We often write the null hypothesis as or .
The complement of the null hypothesis (the collection of parameter values which do not satisfy the null hypothesis) is called the alternative hypothesis.
Definition 9.2 The alternative hypothesis is the set or We often write the alternative hypothesis as or . For simplicity, we often refer to the hypotheses as “the null” and “the alternative”. Figure 9.1(a) illustrates the division of the parameter space into null and alternative hypotheses.

- Null and Alternative Hypotheses
.jpg)
- Acceptance and Rejection Regions
Figure 9.1: Hypothesis Testing
In hypothesis testing, we assume that there is a true (but unknown) value of and this value either satisfies or does not satisfy . The goal of hypothesis testing is to assess whether or not is true by asking if is consistent with the observed data.
To be specific, take our example of wage determination and consider the question: Does union membership affect wages? We can turn this into a hypothesis test by specifying the null as the restriction that a coefficient on union membership is zero in a wage regression. Consider, for example, the estimates reported in Table 4.1. The coefficient for “Male Union Member” is (a wage premium of ) and the coefficient for “Female Union Member” is (a wage premium of ). These are estimates, not the true values. The question is: Are the true coefficients zero? To answer this question the testing method asks the question: Are the observed estimates compatible with the hypothesis, in the sense that the deviation from the hypothesis can be reasonably explained by stochastic variation? Or are the observed estimates incompatible with the hypothesis, in the sense that that the observed estimates would be highly unlikely if the hypothesis were true?
Acceptance and Rejection
A hypothesis test either accepts the null hypothesis or rejects the null hypothesis in favor of the alternative hypothesis. We can describe these two decisions as “Accept ” and “Reject ”. In the example given in the previous section the decision is either to accept the hypothesis that union membership does not affect wages, or to reject the hypothesis in favor of the alternative that union membership does affect wages.
The decision is based on the data and so is a mapping from the sample space to the decision set. This splits the sample space into two regions and such that if the observed sample falls into we accept , while if the sample falls into we reject . The set is called the acceptance region and the set the rejection or critical region.
It is convenient to express this mapping as a real-valued function called a test statistic
relative to a critical value . The hypothesis test then consists of the decision rule:
Accept if .
Reject if .
Figure 9.1(b) illustrates the division of the sample space into acceptance and rejection regions.
A test statistic should be designed so that small values are likely when is true and large values are likely when is true. There is a well developed statistical theory concerning the design of optimal tests. We will not review that theory here, but instead refer the reader to Lehmann and Romano (2005). In this chapter we will summarize the main approaches to the design of test statistics.
The most commonly used test statistic is the absolute value of the t-statistic
where
is the t-statistic from (7.33), is a point estimator, and its standard error. is an appropriate statistic when testing hypotheses on individual coefficients or real-valued parameters and is the hypothesized value. Quite typically , as interest focuses on whether or not a coefficient equals zero, but this is not the only possibility. For example, interest may focus on whether an elasticity equals 1 , in which case we may wish to test .
Type I Error
A false rejection of the null hypothesis (rejecting when is true) is called a Type I error. The probability of a Type I error is called the size of the test.
The uniform size of the test is the supremum of (9.4) across all data distributions which satisfy . A primary goal of test construction is to limit the incidence of Type I error by bounding the size of the test.
For the reasons discussed in Chapter 7 , in typical econometric models the exact sampling distributions of estimators and test statistics are unknown and hence we cannot explicitly calculate (9.4). Instead, we typically rely on asymptotic approximations. Suppose that the test statistic has an asymptotic distribution under . That is, when is true
as for some continuously-distributed random variable . This is not a substantive restriction as most conventional econometric tests satisfy (9.5). Let denote the distribution of . We call (or ) the asymptotic null distribution. It is desirable to design test statistics whose asymptotic null distribution is known and does not depend on unknown parameters. In this case we say that is asymptotically pivotal.
For example, if the test statistic equals the absolute -statistic from (9.2), then we know from Theorem that if (that is, the null hypothesis holds), then as where . This means that , the distribution of the absolute value of the standard normal as shown in (7.34). This distribution does not depend on unknowns and is pivotal.
We define the asymptotic size of the test as the asymptotic probability of a Type I error:
We see that the asymptotic size of the test is a simple function of the asymptotic null distribution and the critical value . For example, the asymptotic size of a test based on the absolute t-statistic with critical value is .
In the dominant approach to hypothesis testing the researcher pre-selects a significance level and then selects so the asymptotic size is no larger than . When the asymptotic null distribution is pivotal we accomplish this by setting equal to the quantile of the distribution . (If the distribution is not pivotal more complicated methods must be used.) We call the asymptotic critical value because it has been selected from the asymptotic null distribution. For example, since it follows that the asymptotic critical value for the absolute t-statistic is . Calculation of normal critical values is done numerically in statistical software. For example, in MATLAB the command is norminv .
t tests
As we mentioned earlier, the most common test of the one-dimensional hypothesis against the alternative is the absolute value of the -statistic (9.3). We now formally state its asymptotic null distribution, which is a simple application of Theorem 7.11.
Theorem 9.1 Under Assumptions 7.2, 7.3, and . For satisfying , and the test “Reject if ” has asymptotic size .
Theorem 9.1 shows that asymptotic critical values can be taken from the normal distribution. As in our discussion of asymptotic confidence intervals (Section 7.13) the critical value could alternatively be taken from the student distribution, which would be the exact test in the normal regression model (Section 5.12). Indeed, critical values are the default in packages such as Stata. Since the critical values from the student distribution are (slightly) larger than those from the normal distribution, student critical values slightly decrease the rejection probability of the test. In practical applications the difference is typically unimportant unless the sample size is quite small (in which case the asymptotic approximation should be questioned as well).
The alternative hypothesis is sometimes called a “two-sided” alternative. In contrast, sometimes we are interested in testing for one-sided alternatives such as or . Tests of against or are based on the signed t-statistic . The hypothesis is rejected in favor of if where satisfies . Negative values of are not taken as evidence against , as point estimates less than do not point to . Since the critical values are taken from the single tail of the normal distribution they are smaller than for two-sided tests. Specifically, the asymptotic critical value is . Thus, we reject in favor of if .
Conversely, tests of against reject for negative t-statistics, e.g. if . Large positive values of are not evidence for . An asymptotic test rejects if .
There seems to be an ambiguity. Should we use the two-sided critical value or the one-sided critical value 1.645? The answer is that in most cases the two-sided critical value is appropriate. We should use the one-sided critical values only when the parameter space is known to satisfy a one-sided restriction such as . This is when the test of against makes sense. If the restriction is not known a priori then imposing this restriction to test against does not makes sense. Since linear regression coefficients typically do not have a priori sign restrictions, the standard convention is to use two-sided critical values.
This may seem contrary to the way testing is presented in statistical textbooks which often focus on one-sided alternative hypotheses. The latter focus is primarily for pedagogy as the one-sided theoretical problem is cleaner and easier to understand.
Type II Error and Power
A false acceptance of the null hypothesis (accepting when is true) is called a Type II error. The rejection probability under the alternative hypothesis is called the power of the test, and equals 1 minus the probability of a Type II error:
We call the power function and is written as a function of to indicate its dependence on the true value of the parameter .
In the dominant approach to hypothesis testing the goal of test construction is to have high power subject to the constraint that the size of the test is lower than the pre-specified significance level. Generally, the power of a test depends on the true value of the parameter , and for a well-behaved test the power is increasing both as moves away from the null hypothesis and as the sample size increases.
Given the two possible states of the world or and the two possible decisions (Accept or Reject ) there are four possible pairings of states and decisions as is depicted in Table 9.1.
Table 9.1: Hypothesis Testing Decisions
true |
Correct Decision |
Type I Error |
true |
Type II Error |
Correct Decision |
Given a test statistic , increasing the critical value increases the acceptance region while decreasing the rejection region . This decreases the likelihood of a Type I error (decreases the size) but increases the likelihood of a Type II error (decreases the power). Thus the choice of involves a trade-off between size and the power. This is why the significance level of the test cannot be set arbitrarily small. Otherwise the test will not have meaningful power.
It is important to consider the power of a test when interpreting hypothesis tests as an overly narrow focus on size can lead to poor decisions. For example, it is easy to design a test which has perfect size yet has trivial power. Specifically, for any hypothesis we can use the following test: Generate a random variable and reject if . This test has exact size of . Yet the test also has power precisely equal to . When the power of a test equals the size we say that the test has trivial power. Nothing is learned from such a test.
Statistical Significance
Testing requires a pre-selected choice of significance level yet there is no objective scientific basis for choice of . Nevertheless, the common practice is to set (5%). Alternative common values are and . These choices are somewhat the by-product of traditional tables of critical values and statistical software.
The informal reasoning behind the critical value is to ensure that Type I errors should be relatively unlikely - that the decision “Reject ” has scientific strength - yet the test retains power against reasonable alternatives. The decision “Reject ” means that the evidence is inconsistent with the null hypothesis in the sense that it is relatively unlikely ( 1 in 20) that data generated by the null hypothesis would yield the observed test result.
In contrast, the decision “Accept ” is not a strong statement. It does not mean that the evidence supports , only that there is insufficient evidence to reject . Because of this it is more accurate to use the label “Do not Reject ” instead of “Accept ”.
When a test rejects at the significance level it is common to say that the statistic is statistically significant and if the test accepts it is common to say that the statistic is not statistically significant or that it is statistically insignificant. It is helpful to remember that this is simply a compact way of saying “Using the statistic the hypothesis can [cannot] be rejected at the asymptotic level.” Furthermore, when the null hypothesis is rejected it is common to say that the coefficient is statistically significant, because the test has rejected the hypothesis that the coefficient is equal to zero.
Let us return to the example about the union wage premium as measured in Table 4.1. The absolute -statistic for the coefficient on “Male Union Member” is , which is greater than the asymptotic critical value of . Therefore we reject the hypothesis that union membership does not affect wages for men. In this case we can say that union membership is statistically significant for men. However, the absolute t-statistic for the coefficient on “Female Union Member” is , which is less than and therefore we do not reject the hypothesis that union membership does not affect wages for women. In this case we find that membership for women is not statistically significant.
When a test accepts a null hypothesis (when a test is not statistically significant) a common misinterpretation is that this is evidence that the null hypothesis is true. This is incorrect. Failure to reject is by itself not evidence. Without an analysis of power we do not know the likelihood of making a Type II error and thus are uncertain. In our wage example it would be a mistake to write that “the regression finds that female union membership has no effect on wages”. This is an incorrect and most unfortunate interpretation. The test has failed to reject the hypothesis that the coefficient is zero but that does not mean that the coefficient is actually zero.
When a test rejects a null hypothesis (when a test is statistically significant) it is strong evidence against the hypothesis (because if the hypothesis were true then rejection is an unlikely event). Rejection should be taken as evidence against the null hypothesis. However, we can never conclude that the null hypothesis is indeed false as we cannot exclude the possibility that we are making a Type I error.
Perhaps more importantly, there is an important distinction between statistical and economic significance. If we correctly reject the hypothesis it means that the true value of is non-zero. This includes the possibility that may be non-zero but close to zero in magnitude. This only makes sense if we interpret the parameters in the context of their relevant models. In our wage regression example we might consider wage effects of magnitude or less as being “close to zero”. In a log wage regression this corresponds to a dummy variable with a coefficient less than . If the standard error is sufficiently small (less than ) then a coefficient estimate of will be statistically significant but not economically significant. This occurs frequently in applications with very large sample sizes where standard errors can be quite small.
The solution is to focus whenever possible on confidence intervals and the economic meaning of the coefficients. For example, if the coefficient estimate is with a standard error of then a confidence interval would be indicating that the true effect is likely between and , and hence is slightly positive but small. This is much more informative than the misleading statement “the effect is statistically positive”.
P-Values
Continuing with the wage regression estimates reported in Table 4.1, consider another question: Does marriage status affect wages? To test the hypothesis that marriage status has no effect on wages, we examine the t-statistics for the coefficients on “Married Male” and “Married Female” in Table 4.1, which are and , respectively. The first exceeds the asymptotic critical value of so we reject the hypothesis for men. The second is smaller than so we fail to reject the hypothesis for women. Taking a second look at the statistics we see that the statistic for men (22) is exceptionally high and that for women (1.7) is only slightly below the critical value. Suppose that the -statistic for women were slightly increased to 2.0. This is larger than the critical value so would lead to the decision “Reject ” rather than “Accept ”. Should we really be making a different decision if the -statistic is rather than 1.7? The difference in values is small, shouldn’t the difference in the decision be also small? Thinking through these examples it seems unsatisfactory to simply report “Accept ” or “Reject ”. These two decisions do not summarize the evidence. Instead, the magnitude of the statistic suggests a “degree of evidence” against . How can we take this into account?
The answer is to report what is known as the asymptotic p-value
Since the distribution function is monotonically increasing, the p-value is a monotonically decreasing function of and is an equivalent test statistic. Instead of rejecting at the significance level if , we can reject if . Thus it is sufficient to report , and let the reader decide. In practice, the p-value is calculated numerically. For example, in MATLAB the command is .
It is instructive to interpret as the marginal significance level: the smallest value of for which the test “rejects” the null hypothesis. That is, means that rejects for all significance levels greater than , but fails to reject for significance levels less than .
Furthermore, the asymptotic p-value has a very convenient asymptotic null distribution. Since under , then , which has the distribution
which is the uniform distribution on . (This calculation assumes that is strictly increasing which is true for conventional asymptotic distributions such as the normal.) Thus . This means that the “unusualness” of is easier to interpret than the “unusualness” of .
An important caveat is that the -value should not be interpreted as the probability that either hypothesis is true. A common mis-interpretation is that is the probability “that the null hypothesis is true.” This is incorrect. Rather, is the marginal significance level-a measure of the strength of information against the null hypothesis. For a t-statistic the p-value can be calculated either using the normal distribution or the student distribution, the latter presented in Section 5.12. p-values calculated using the student will be slightly larger, though the difference is small when the sample size is large.
Returning to our empirical example, for the test that the coefficient on “Married Male” is zero the pvalue is . This means that it would be nearly impossible to observe a t-statistic as large as 22 when the true value of the coefficient is zero. When presented with such evidence we can say that we “strongly reject” the null hypothesis, that the test is “highly significant”, or that “the test rejects at any conventional critical value”. In contrast, the p-value for the coefficient on “Married Female” is . In this context it is typical to say that the test is “close to significant”, meaning that the p-value is larger than , but not too much larger.
A related but inferior empirical practice is to append asterisks to coefficient estimates or test statistics to indicate the level of significance. A common practice to to append a single asterisk (\textit{) for an estimate or test statistic which exceeds the critical value (i.e., is significant at the level), append a double asterisk () for a test which exceeds the critical value, and append a triple asterisk (}) for a test which exceeds the critical value. Such a practice can be better than a table of raw test statistics as the asterisks permit a quick interpretation of significance. On the other hand, asterisks are inferior to p-values, which are also easy and quick to interpret. The goal is essentially the same; it is wiser to report p-values whenever possible and avoid the use of asterisks.
Our recommendation is that the best empirical practice is to compute and report the asymptotic pvalue rather than simply the test statistic , the binary decision Accept/Reject, or appending asterisks. The p-value is a simple statistic, easy to interpret, and contains more information than the other choices.
We now summarize the main features of hypothesis testing.
Select a significance level .
Select a test statistic with asymptotic distribution under .
Set the asymptotic critical value so that , where is the distribution function of .
Calculate the asymptotic p-value .
Reject if , or equivalently .
Accept if , or equivalently .
Report to summarize the evidence concerning versus .
t-ratios and the Abuse of Testing
In Section we argued that a good applied practice is to report coefficient estimates and standard errors for all coefficients of interest in estimated models. With and the reader can easily construct confidence intervals and t-statistics for hypotheses of interest.
Some applied papers (especially older ones) report t-ratios instead of standard errors. This is poor econometric practice. While the same information is being reported (you can back out standard errors by division, e.g. , standard errors are generally more helpful to readers than t-ratios. Standard errors help the reader focus on the estimation precision and confidence intervals, while t-ratios focus attention on statistical significance. While statistical significance is important, it is less important that the parameter estimates themselves and their confidence intervals. The focus should be on the meaning of the parameter estimates, their magnitudes, and their interpretation, not on listing which variables have significant (e.g. non-zero) coefficients. In many modern applications sample sizes are very large so standard errors can be very small. Consequently t-ratios can be large even if the coefficient estimates are economically small. In such contexts it may not be interesting to announce “The coefficient is non-zero!” Instead, what is interesting to announce is that “The coefficient estimate is economically interesting!”
In particular, some applied papers report coefficient estimates and t-ratios and limit their discussion of the results to describing which variables are “significant” (meaning that their t-ratios exceed 2) and the signs of the coefficient estimates. This is very poor empirical work and should be studiously avoided. It is also a recipe for banishment of your work to lower tier economics journals.
Fundamentally, the common t-ratio is a test for the hypothesis that a coefficient equals zero. This should be reported and discussed when this is an interesting economic hypothesis of interest. But if this is not the case it is distracting.
One problem is that standard packages, such as Stata, by default report t-statistics and p-values for every estimated coefficient. While this can be useful (as a user doesn’t need to explicitly ask to test a desired coefficient) it can be misleading as it may unintentionally suggest that the entire list of t-statistics and p-values are important. Instead, a user should focus on tests of scientifically motivated hypotheses.
In general, when a coefficient is of interest it is constructive to focus on the point estimate, its standard error, and its confidence interval. The point estimate gives our “best guess” for the value. The standard error is a measure of precision. The confidence interval gives us the range of values consistent with the data. If the standard error is large then the point estimate is not a good summary about . The endpoints of the confidence interval describe the bounds on the likely possibilities. If the confidence interval embraces too broad a set of values for then the dataset is not sufficiently informative to render useful inferences about . On the other hand if the confidence interval is tight then the data have produced an accurate estimate and the focus should be on the value and interpretation of this estimate. In contrast, the statement “the t-ratio is highly significant” has little interpretive value.
The above discussion requires that the researcher knows what the coefficient means (in terms of the economic problem) and can interpret values and magnitudes, not just signs. This is critical for good applied econometric practice.
For example, consider the question about the effect of marriage status on mean log wages. We had found that the effect is “highly significant” for men and “close to significant” for women. Now, let’s construct asymptotic confidence intervals for the coefficients. The one for men is and that for women is . This shows that average wages for married men are about higher than for unmarried men, which is substantial, while the difference for women is about 0-3%, which is small. These magnitudes are more informative than the results of the hypothesis tests.
Wald Tests
The t-test is appropriate when the null hypothesis is a real-valued restriction. More generally there may be multiple restrictions on the coefficient vector . Suppose that we have restrictions which can be written in the form (9.1). It is natural to estimate by the plug-in estimator . To test against one approach is to measure the magnitude of the discrepancy . As this is a vector there is more than one measure of its length. One simple measure is the weighted quadratic form known as the Wald statistic. This is (7.37) evaluated at the null hypothesis
where is an estimator of and . Notice that we can write alternatively as
using the asymptotic variance estimator , or we can write it directly as a function of as
Also, when is a linear function of , then the Wald statistic simplifies to
The Wald statistic is a weighted Euclidean measure of the length of the vector . When then , the square of the t-statistic, so hypothesis tests based on and are equivalent. The Wald statistic (9.6) is a generalization of the t-statistic to the case of multiple restrictions. As the Wald statistic is symmetric in the argument it treats positive and negative alternatives symmetrically. Thus the inherent alternative is always two-sided.
As shown in Theorem 7.13, when satisfies then , a chi-square random variable with degrees of freedom. Let denote the distribution function. For a given significance level the asymptotic critical value satisfies . For example, the critical values for , and are , and , respectively, and in general the level critical value can be calculated in MATLAB as chi2inv . An asymptotic test rejects in favor of if . As with t-tests, it is conventional to describe a Wald test as “significant” if exceeds the asymptotic critical value.
Theorem 9.2 Under Assumptions 7.2, 7.3, 7.4, and , then . For satisfying so the test “Reject if ” has asymptotic size .
Notice that the asymptotic distribution in Theorem depends solely on , the number of restrictions being tested. It does not depend on , the number of parameters estimated.
The asymptotic p-value for is , and this is particularly useful when testing multiple restrictions. For example, if you write that a Wald test on eight restrictions ( ) has the value it is difficult for a reader to assess the magnitude of this statistic unless they have quick access to a statistical table or software. Instead, if you write that the p-value is (as is the case for and ) then it is simple for a reader to interpret its magnitude as “insignificant”. To calculate the asymptotic p-value for a Wald statistic in MATLAB use the command .
Some packages (including Stata) and papers report versions of Wald statistics. For any Wald statistic which tests a -dimensional restriction, the version of the test is
When is reported, it is conventional to use critical values and -values rather than values. The connection between Wald and F statistics is demonstrated in Section where we show that when Wald statistics are calculated using a homoskedastic covariance matrix then is identicial to the F statistic of (5.19). While there is no formal justification to using the distribution for nonhomoskedastic covariance matrices, the distribution provides continuity with the exact distribution theory under normality and is a bit more conservative than the distribution. (Furthermore, the difference is small when is moderately large.)
To implement a test of zero restrictions in Stata an easy method is to use the command test X1 X2 where X1 and X2 are the names of the variables whose coefficients are hypothesized to equal zero. The version of the Wald statistic is reported using the covariance matrix calculated by the method specified in the regression command. A p-value is reported, calculated using the distribution.
To illustrate, consider the empirical results presented in Table 4.1. The hypothesis “Union membership does not affect wages” is the joint restriction that both coefficients on “Male Union Member” and “Female Union Member” are zero. We calculate the Wald statistic for this joint hypothesis and find (or ) with a p-value of . Thus we reject the null hypothesis in favor of the alternative that at least one of the coefficients is non-zero. This does not mean that both coefficients are non-zero, just that one of the two is non-zero. Therefore examining both the joint Wald statistic and the individual t-statistics is useful for interpretation.
As a second example from the same regression, take the hypothesis that married status has no effect on mean wages for women. This is the joint restriction that the coefficients on “Married Female” and “Formerly Married Female” are zero. The Wald statistic for this hypothesis is with a p-value of . Such a p-value is typically called “marginally significant” in the sense that it is slightly smaller than .
The Wald statistic was proposed by Wald (1943).

Homoskedastic Wald Tests
If the error is known to be homoskedastic then it is appropriate to use the homoskedastic Wald statistic (7.38) which replaces with the homoskedastic estimator . This statistic equals
In the case of linear hypotheses we can write this as
We call a homoskedastic Wald statistic as it is appropriate when the errors are conditionally homoskedastic.
When then , the square of the t-statistic where the latter is computed with a homoskedastic standard error. Theorem 9.3 Under Assumptions and 7.3, , and , then . For satisfying so the test “Reject if ” has asymptotic size .
Criterion-Based Tests
The Wald statistic is based on the length of the vector : the discrepancy between the estimator and the hypothesized value . An alternative class of tests is based on the discrepancy between the criterion function minimized with and without the restriction.
Criterion-based testing applies when we have a criterion function, say with , which is minimized for estimation, and the goal is to test versus where . Minimizing the criterion function over and we obtain the unrestricted and restricted estimators
The criterion-based statistic for versus is proportional to
The criterion-based statistic is sometimes called a distance statistic, a minimum-distance statistic, or a likelihood-ratio-like statistic.
Since is a subset of and thus . The statistic measures the cost on the criterion of imposing the null restriction .
Minimum Distance Tests
The minimum distance test is based on the minimum distance criterion (8.19)
with the unrestricted least squares estimator. The restricted estimator minimizes (9.8) subject to . Observing that , the minimum distance statistic simplifies to
The efficient minimum distance estimator is obtained by setting in (9.8) and (9.9). The efficient minimum distance statistic for is therefore
Consider the class of linear hypotheses . In this case we know from (8.25) that the efficient minimum distance estimator subject to the constraint is
and thus
Substituting into (9.10) we find
which is the Wald statistic (9.6).
Thus for linear hypotheses , the efficient minimum distance statistic is identical to the Wald statistic (9.6). For nonlinear hypotheses, however, the Wald and minimum distance statistics are different.
Newey and West (1987a) established the asymptotic null distribution of .
Theorem 9.4 Under Assumptions , and .
Testing using the minimum distance statistic is similar to testing using the Wald statistic . Critical values and p-values are computed using the distribution. is rejected in favor of if exceeds the level critical value, which can be calculated in MATLAB as chi2inv . The asymptotic pvalue is . In MATLAB, use the command .
We now demonstrate Theorem 9.4. The conditions of Theorem hold, because implies Assumption 8.1. From (8.54) with , we see that
where . Thus
as claimed.
Minimum Distance Tests Under Homoskedasticity
If we set in (9.8) we obtain the criterion (8.20)
A minimum distance statistic for is
Equation (8.21) showed that . So the minimizers of and are identical. Thus the constrained minimizer of is constrained least squares
and therefore
In the special case of linear hypotheses , the constrained least squares estimator subject to has the solution (8.9)
and solving we find
This is the homoskedastic Wald statistic (9.7). Thus for testing linear hypotheses, homoskedastic minimum distance and Wald statistics agree.
For nonlinear hypotheses they disagree, but have the same null asymptotic distribution.
Theorem 9.5 Under Assumptions and , and , then
F Tests
In Section we introduced the test for exclusion restrictions in the normal regression model. In this section we generalize this test to a broader set of restrictions. Let be a constrained parameter space which imposes restrictions on .
Let be the unrestricted least squares estimator and let be the associated estimator of . Let be the CLS estimator (9.11) satisfying and let be the associated estimator of . The statistic for testing is
We can alternatively write
where is the sum-of-squared errors.
This shows that is a criterion-based statistic. Using (8.21) we can also write , so the statistic is identical to the homoskedastic minimum distance statistic divided by the number of restrictions .
As we discussed in the previous section, in the special case of linear hypotheses . It follows that in this case . Thus for linear restrictions the statistic equals the homoskedastic Wald statistic divided by . It follows that they are equivalent tests for against . Theorem 9.6 For tests of linear hypotheses , the statistic equals where is the homoskedastic Wald statistic. Thus under 7.2, , and , then .
When using an statistic it is conventional to use the distribution for critical values and pvalues. Critical values are given in MATLAB by inv and -values by . Alternatively, the distribution can be used, using chi2inv and , respectively. Using the distribution is a prudent small sample adjustment which yields exact answers if the errors are normal and otherwise slightly increasing the critical values and p-values relative to the asymptotic approximation. Once again, if the sample size is small enough that the choice makes a difference then probably we shouldn’t be trusting the asymptotic approximation anyway!
An elegant feature about (9.12) or (9.13) is that they are directly computable from the standard output from two simple OLS regressions, as the sum of squared errors (or regression variance) is a typical printed output from statistical packages and is often reported in applied tables. Thus can be calculated by hand from standard reported statistics even if you don’t have the original data (or if you are sitting in a seminar and listening to a presentation!).
If you are presented with an statistic (or a Wald statistic, as you can just divide by ) but don’t have access to critical values, a useful rule of thumb is to know that for large the asymptotic critical value is decreasing as increases and is less than 2 for .
A word of warning: In many statistical packages when an OLS regression is estimated an “F-statistic” is automatically reported even though no hypothesis test was requested. What the package is reporting is an statistic of the hypothesis that all slope coefficients are zero. This was a popular statistic in the early days of econometric reporting when sample sizes were very small and researchers wanted to know if there was “any explanatory power” to their regression. This is rarely an issue today as sample sizes are typically sufficiently large that this statistic is nearly always highly significant. While there are special cases where this statistic is useful these cases are not typical. As a general rule there is no reason to report this statistic.
Hausman Tests
Hausman (1978) introduced a general idea about how to test a hypothesis . If you have two estimators, one which is efficient under but inconsistent under , and another which is consistent under , then construct a test as a quadratic form in the differences of the estimators. In the case of testing a hypothesis let denote the unconstrained least squares estimator and let denote the efficient minimum distance estimator which imposes . Both estimators are consistent under but is asymptotically efficient. Under is consistent for but is inconsistent. The difference has the asymptotic distribution
Let denote the Moore-Penrose generalized inverse. The Hausman statistic for is
All coefficients except the intercept. The matrix idempotent so its generalized inverse is itself. (See Section A.11.) It follows that
Thus the Hausman statistic is
In the context of linear restrictions, and so the statistic takes the form
which is precisely the Wald statistic. With nonlinear restrictions and can differ.
In either case we see that that the asymptotic null distribution of the Hausman statistic is , so the appropriate test is to reject in favor of if where is a critical value taken from the distribution.
Theorem 9.7 For general hypotheses the Hausman test statistic is
Under Assumptions , and
Score Tests
Score tests are traditionally derived in likelihood analysis but can more generally be constructed from first-order conditions evaluated at restricted estimates. We focus on the likelihood derivation.
Given the log likelihood function , a restriction , and restricted estimators and , the score statistic for is defined as
The idea is that if the restriction is true then the restricted estimators should be close to the maximum of the log-likelihood where the derivative is zero. However if the restriction is false then the restricted estimators should be distant from the maximum and the derivative should be large. Hence small values of are expected under and large values under . Tests of reject for large values of .
We explore the score statistic in the context of the normal regression model and linear hypotheses . Recall that in the normal regression log-likelihood function is

The constrained MLE under linear hypotheses is constrained least squares
We can calculate that the derivative and Hessian are
Since we can further calculate that
Together we find that
This is identical to the homoskedastic Wald statistic with replaced by . We can also write as a monotonic transformation of the statistic, as
The test “Reject for large values of ” is identical to the test “Reject for large values of ” so they are identical tests. Since for the normal regression model the exact distribution of is known, it is better to use the statistic with p-values.
In more complicated settings a potential advantage of score tests is that they are calculated using the restricted parameter estimates rather than the unrestricted estimates . Thus when is relatively easy to calculate there can be a preference for score statistics. This is not a concern for linear restrictions.
More generally, score and score-like statistics can be constructed from first-order conditions evaluated at restricted parameter estimates. Also, when test statistics are constructed using covariance matrix estimators which are calculated using restricted parameter estimates (e.g. restricted residuals) then these are often described as score tests.
An example of the latter is the Wald-type statistic
where the covariance matrix estimate is calculated using the restricted residuals . This may be a good choice when and are high-dimensional as in this context there may be worry that the estimator is imprecise.
Problems with Tests of Nonlinear Hypotheses
While the and Wald tests work well when the hypothesis is a linear restriction on , they can work quite poorly when the restrictions are nonlinear. This can be seen by a simple example introduced by Lafontaine and White (1986). Take the model ) and consider the hypothesis . Let and be the sample mean and variance of . The standard Wald statistic to test is
Notice that is equivalent to the hypothesis for any positive integer . Letting , and noting , we find that the Wald statistic to test is
While the hypothesis is unaffected by the choice of , the statistic varies with . This is an unfortunate feature of the Wald statistic.
To demonstrate this effect, we have plotted in Figure the Wald statistic as a function of , setting . The increasing line is for the case . The decreasing line is for the case . It is easy to see that in each case there are values of for which the test statistic is significant relative to asymptotic critical values, while there are other values of for which the test statistic is insignificant. This is distressing because the choice of is arbitrary and irrelevant to the actual hypothesis.
Our first-order asymptotic theory is not useful to help pick , as under for any . This is a context where Monte Carlo simulation can be quite useful as a tool to study and compare the exact distributions of statistical procedures in finite samples. The method uses random simulation to create artificial datasets to which we apply the statistical tools of interest. This produces random draws from the statistic’s sampling distribution. Through repetition, features of this distribution can be calculated.
In the present context of the Wald statistic, one feature of importance is the Type I error of the test using the asymptotic critical value - the probability of a false rejection, . Given the simplicity of the model this probability depends only on , and . In Table we report the results of a Monte Carlo simulation where we vary these three parameters. The value of is varied from 1 to is varied among 20,100 , and 500 , and is varied among 1 and 3 . The table reports the simulation estimate of the Type I error probability from 50,000 random samples. Each row of the table corresponds to a different value of - and thus corresponds to a particular choice of test statistic. The second through seventh columns contain the Type I error probabilities for different combinations of and . These probabilities are calculated as the percentage of the 50,000 simulated Wald statistics which are larger than 3.84. The null hypothesis is true so these probabilities are Type I error.
To interpret the table remember that the ideal Type I error probability is with deviations indicating distortion. Type I error rates between and are considered reasonable. Error rates above are considered excessive. Rates above are unacceptable. When comparing statistical procedures we compare the rates row by row, looking for tests for which rejection rates are close to and rarely fall outside of the range. For this particular example the only test which meets this criterion is the conventional test. Any other leads to a test with unacceptable Type I error probabilities.
In Table you can also see the impact of variation in sample size. In each case the Type I error probability improves towards as the sample size increases. There is, however, no magic choice of for which all tests perform uniformly well. Test performance deteriorates as increases which is not surprising given the dependence of on as shown in Figure 9.2.

Figure 9.2: Wald Statistic as a Function of
In this example it is not surprising that the choice yields the best test statistic. Other choices are arbitrary and would not be used in practice. While this is clear in this particular example, in other examples natural choices are not obvious and the best choices may be counter-intuitive.
This point can be illustrated through an example based on Gregory and Veall (1985). Take the model
and the hypothesis where is a known constant. Equivalently, define so the hypothesis can be stated as .
Let be the least squares estimator of , let be an estimator of the covariance matrix for and set . Define
Table 9.2: Type I Error Probability of Asymptotic Test

Rejection frequencies from 50,000 simulated random samples.
so that the standard error for is . In this case a t-statistic for is
An alternative statistic can be constructed through reformulating the null hypothesis as
A t-statistic based on this formulation of the hypothesis is
where
To compare and we perform another simple Monte Carlo simulation. We let and be mutually independent variables, be an independent draw with , and normalize and . This leaves as a free parameter along with sample size . We vary among , , and and among 100 and 500 .
The one-sided Type I error probabilities and are calculated from 50,000 simulated samples. The results are presented in Table 9.3. Ideally, the entries in the table should be . However, the rejection rates for the statistic diverge greatly from this value, especially for small values of . The left tail probabilities greatly exceed , while the right tail probabilities are close to zero in most cases. In contrast, the rejection rates for the statistic are invariant to the value of and equal for both sample sizes. The implication of Table is that the two t-ratios have dramatically different sampling behavior.
The common message from both examples is that Wald statistics are sensitive to the algebraic formulation of the null hypothesis. Table 9.3: Type I Error Probability of Asymptotic 5% t-tests

Rejection frequencies from 50,000 simulated random samples.
A simple solution is to use the minimum distance statistic which equals with in the first example, and in the second example. The minimum distance statistic is invariant to the algebraic formulation of the null hypothesis so is immune to this problem. Whenever possible, the Wald statistic should not be used to test nonlinear hypotheses.
Theoretical investigations of these issues include Park and Phillips (1988) and Dufour (1997).
Monte Carlo Simulation
In Section we introduced the method of Monte Carlo simulation to illustrate the small sample problems with tests of nonlinear hypotheses. In this section we describe the method in more detail.
Recall, our data consist of observations which are random draws from a population distribution . Let be a parameter and let be a statistic of interest, for example an estimator or a t-statistic . The exact distribution of is
While the asymptotic distribution of might be known, the exact (finite sample) distribution is generally unknown.
Monte Carlo simulation uses numerical simulation to compute for selected choices of . This is useful to investigate the performance of the statistic in reasonable situations and sample sizes. The basic idea is that for any given the distribution function can be calculated numerically through simulation. The name Monte Carlo derives from the Mediterranean gambling resort where games of chance are played.
The method of Monte Carlo is simple to describe. The researcher chooses (the distribution of the pseudo data) and the sample size . A “true” value of is implied by this choice, or equivalently the value is selected directly by the researcher which implies restrictions on .
Then the following experiment is conducted by computer simulation:
independent random pairs , are drawn from the distribution using the computer’s random number generator.
The statistic is calculated on this pseudo data.
For step 1, computer packages have built-in random number procedures including and . From these most random variables can be constructed. (For example, a chi-square can be generated by sums of squares of normals.) For step 2, it is important that the statistic be evaluated at the “true” value of corresponding to the choice of .
The above experiment creates one random draw from the distribution . This is one observation from an unknown distribution. Clearly, from one observation very little can be said. So the researcher repeats the experiment times where is a large number. Typically, we set . We will discuss this choice later.
Notationally, let the experiment result in the draw . These results are stored. After all experiments have been calculated these results constitute a random sample of size from the distribution of .
From a random sample we can estimate any feature of interest using (typically) a method of moments estimator. We now describe some specific examples.
Suppose we are interested in the bias, mean-squared error (MSE), and/or variance of the distribution of . We then set , run the above experiment, and calculate
Suppose we are interested in the Type I error associated with an asymptotic 5% two-sided t-test. We would then set and calculate
the percentage of the simulated t-ratios which exceed the asymptotic critical value.
Suppose we are interested in the and quantile of or . We then compute the and sample quantiles of the sample . For details on quantile estimation see Section of Probability and Statistics for Economists.
The typical purpose of a Monte Carlo simulation is to investigate the performance of a statistical procedure in realistic settings. Generally, the performance will depend on and . In many cases an estimator or test may perform wonderfully for some values and poorly for others. It is therefore useful to conduct a variety of experiments for a selection of choices of and .
As discussed above the researcher must select the number of experiments . Often this is called the number of replications. Quite simply, a larger results in more precise estimates of the features of interest of but requires more computational time. In practice, therefore, the choice of is often guided by the computational demands of the statistical procedure. Since the results of a Monte Carlo experiment are estimates computed from a random sample of size it is straightforward to calculate standard errors for any quantity of interest. If the standard error is too large to make a reliable inference then will have to be increased. A useful rule-of-thumb is to set whenever possible.
In particular, it is simple to make inferences about rejection probabilities from statistical tests, such as the percentage estimate reported in (9.15). The random variable is i.i.d. Bernoulli, equalling 1 with probability . The average (9.15) is therefore an unbiased estimator of with standard error . As is unknown, this may be approximated by replacing with or with an hypothesized value. For example, if we are assessing an asymptotic test, then we can set . Hence, standard errors for , and 5000, are, respectively, , and Most papers in econometric methods and some empirical papers include the results of Monte Carlo simulations to illustrate the performance of their methods. When extending existing results it is good practice to start by replicating existing (published) results. This may not be exactly possible in the case of simulation results as they are inherently random. For example suppose a paper investigates a statistical test and reports a simulated rejection probability of based on a simulation with replications. Suppose you attempt to replicate this result and find a rejection probability of (again using simulation replications). Should you conclude that you have failed in your attempt? Absolutely not! Under the hypothesis that both simulations are identical you have two independent estimates, and , of a common probability . The asymptotic (as ) distribution of their difference is , so a standard error for is , using the estimate . Since the t-ratio is not statistically significant it is incorrect to reject the null hypothesis that the two simulations are identical. The difference between the results and is consistent with random variation.
What should be done? The first mistake was to copy the previous paper’s choice of . Instead, suppose you set and now obtain . Then and a standard error is . Still we cannot reject the hypothesis that the two simulations are different. Even though the estimates ( and appear to be quite different, the difficulty is that the original simulation used a very small number of replications so the reported estimate is quite imprecise. In this case it is appropriate to conclude that your results “replicate” the previous study as there is no statistical evidence to reject the hypothesis that they are equivalent.
Most journals have policies requiring authors to make available their data sets and computer programs required for empirical results. Most do not have similar policies regarding simulations. Nevertheless, it is good professional practice to make your simulations available. The best practice is to post your simulation code on your webpage. This invites others to build on and use your results, leading to possible collaboration, citation, and/or advancement.
Confidence Intervals by Test Inversion
There is a close relationship between hypothesis tests and confidence intervals. We observed in Section that the standard asymptotic confidence interval for a parameter is
That is, we can describe as “The point estimate plus or minus 2 standard errors” or “The set of parameter values not rejected by a two-sided t-test.” The second definition, known as test statistic inversion, is a general method for finding confidence intervals, and typically produces confidence intervals with excellent properties.
Given a test statistic and critical value , the acceptance region “Accept if ” is identical to the confidence interval . Since the regions are identical the probability of coverage equals the probability of correct acceptance Accept which is exactly 1 minus the Type I error probability. Thus inverting a test with good Type I error probabilities yields a confidence interval with good coverage probabilities.
Now suppose that the parameter of interest is a nonlinear function of the coefficient vector . In this case the standard confidence interval for is the set as in (9.16) where is the point estimator and is the delta method standard error. This confidence interval is inverting the t-test based on the nonlinear hypothesis . The trouble is that in Section we learned that there is no unique t-statistic for tests of nonlinear hypotheses and that the choice of parameterization matters greatly. For example, if then the coverage probability of the standard interval (9.16) is 1 minus the probability of the Type I error, which as shown in Table can be far from the nominal .
In this example a good solution is the same as discussed in Section - to rewrite the hypothesis as a linear restriction. The hypothesis is the same as . The t-statistic for this restriction is
where
and is the covariance matrix for . A 95% confidence interval for is the set of values of such that . Since is a nonlinear function of one method to find the confidence set is grid search over .
For example, in the wage equation
the highest expected wage occurs at experience . From Table we have the point estimate and we can calculate the standard error for a 95% confidence interval . However, if we instead invert the linear form of the test we numerically find the interval which is much larger. From the evidence presented in Section we know the first interval can be quite inaccurate and the second interval is greatly preferred.
Multiple Tests and Bonferroni Corrections
In most applications economists examine a large number of estimates, test statistics, and p-values. What does it mean (or does it mean anything) if one statistic appears to be “significant” after examining a large number of statistics? This is known as the problem of multiple testing or multiple comparisons.
To be specific, suppose we examine a set of coefficients, standard errors and t-ratios, and consider the “significance” of each statistic. Based on conventional reasoning, for each coefficient we would reject the hypothesis that the coefficient is zero with asymptotic size if the absolute t-statistic exceeds the critical value of the normal distribution, or equivalently if the -value for the t-statistic is smaller than . If we observe that one of the statistics is “significant” based on this criterion, that means that one of the p-values is smaller than , or equivalently, that the smallest p-value is smaller than . We can then rephrase the question: Under the joint hypothesis that a set of hypotheses are all true, what is the probability that the smallest -value is smaller than ? In general, we cannot provide a precise answer to this quesion, but the Bonferroni correction bounds this probability by . The Bonferroni method furthermore suggests that if we want the familywise error probability (the probability that one of the tests falsely rejects) to be bounded below , then an appropriate rule is to reject only if the smallest p-value is smaller than . Equivalently, the Bonferroni familywise -value is .
Formally, suppose we have hypotheses . For each we have a test and associated pvalue with the property that when is true . We then observe that among the tests, one of the is “significant” if . This event can be written as
Boole’s inequality states that for any events . Thus
as stated. This demonstates that the asymptotic familywise rejection probability is at most times the individual rejection probability.
Furthermore,
This demonstrates that the asymptotic familywise rejection probability can be controlled (bounded below ) if each individual test is subjected to the stricter standard that a p-value must be smaller than to be labeled as “significant”.
To illustrate, suppose we have two coefficient estimates with individual p-values and . Based on a conventional level the standard individual tests would suggest that the first coefficient estimate is “significant” but not the second. A Bonferroni 5% test, however, does not reject as it would require that the smallest p-value be smaller than , which is not the case in this example. Alternatively, the Bonferroni familywise -value is , which is not significant at the level.
In contrast, if the two p-values were and , then the Bonferroni familywise p-value would be , which is significant at the level.
Power and Test Consistency
The power of a test is the probability of rejecting when is true.
For simplicity suppose that is i.i.d. with known, consider the t-statistic , and tests of against . We reject if . Note that
and has an exact distribution. This is because is centered at the true mean , while the test statistic is centered at the (false) hypothesized mean of 0 .
The power of the test is
This function is monotonically increasing in and , and decreasing in and .
Notice that for any and the power increases to 1 as . This means that for the test will reject with probability approaching 1 as the sample size gets large. We call this property test consistency.
Definition 9.3 A test of is consistent against fixed alternatives if for all Reject as .
For tests of the form “Reject if ”, a sufficient condition for test consistency is that the diverges to positive infinity with probability one for all . Definition 9.4 We say that as if for all as . Similarly, we say that as if for all , as
In general, t-tests and Wald tests are consistent against fixed alternatives. Take a t-statistic for a test of where is a known value and . Note that
The first term on the right-hand-side converges in distribution to . The second term on the righthand-side equals zero if , converges in probability to if , and converges in probability to if . Thus the two-sided t-test is consistent against , and one-sided t-tests are consistent against the alternatives for which they are designed.
Theorem 9.8 Under Assumptions 7.2, 7.3, and 7.4, for and , then . For any the test “Reject if ” is consistent against fixed alternatives.
The Wald statistic for against is . Under , . Thus . Hence under . Again, this implies that Wald tests are consistent.
Theorem 9.9 Under Assumptions 7.2, 7.3, and 7.4, for , then . For any the test “Reject if ” is consistent against fixed alternatives.
Asymptotic Local Power
Consistency is a good property for a test but is does not provided a tool to calculate test power. To approximate the power function we need a distributional approximation.
The standard asymptotic method for power analysis uses what are called local alternatives. This is similar to our analysis of restriction estimation under misspecification (Section 8.13). The technique is to index the parameter by sample size so that the asymptotic distribution of the statistic is continuous in a localizing parameter. In this section we consider t-tests on real-valued parameters and in the next section Wald tests. Specifically, we consider parameter vectors which are indexed by sample size and satisfy the real-valued relationship
where the scalar is called a localizing parameter. We index and by sample size to indicate their dependence on . The way to think of (9.17) is that the true value of the parameters are and . The parameter is close to the hypothesized value , with deviation .
The specification (9.17) states that for any fixed approaches as gets large. Thus is “close” or “local” to . The concept of a localizing sequence (9.17) might seem odd since in the actual world the sample size cannot mechanically affect the value of the parameter. Thus (9.17) should not be interpreted literally. Instead, it should be interpreted as a technical device which allows the asymptotic distribution to be continuous in the alternative hypothesis.
To evaluate the asymptotic distribution of the test statistic we start by examining the scaled estimator centered at the hypothesized value . Breaking it into a term centered at the true value and a remainder we find
where the second equality is (9.17). The first term is asymptotically normal:
where . Therefore
This asymptotic distribution depends continuously on the localizing parameter .
Applied to the statistic we find
where . This generalizes Theorem (which assumes is true) to allow for local alternatives of the form (9.17).
Consider a t-test of against the one-sided alternative which rejects for where . The asymptotic local power of this test is the limit (as the sample size diverges) of the rejection probability under the local alternative (9.17)
We call the asymptotic local power function.
In Figure 9.3(a) we plot the local power function as a function of for tests of asymptotic size and corresponds to the null hypothesis so . The power functions are monotonically increasing in . Note that the power is lower than for due to the one-sided nature of the test.
We can see that the power functions are ranked by so that the test with has higher power than the test with . This is the inherent trade-off between size and power. Decreasing size induces a decrease in power, and conversely.

- One-Sided t Test
.jpg)
- Vector Case
Figure 9.3: Asymptotic Local Power Function
The coefficient can be interpreted as the parameter deviation measured as a multiple of the standard error . To see this, recall that and then note that
Thus approximately equals the deviation expressed as multiples of the standard error . Thus as we examine Figure 9.3(a) we can interpret the power function at (e.g. for a 5% size test) as the power when the parameter is one standard error above the hypothesized value. For example, from Table the standard error for the coefficient on “Married Female” is . Thus, in this example corresponds to or an wage premium for married females. Our calculations show that the asymptotic power of a one-sided test against this alternative is about .
The difference between power functions can be measured either vertically or horizontally. For example, in Figure 9.3(a) there is a vertical dashed line at , showing that the asymptotic local power function equals for , and for . This is the difference in power across tests of differing size, holding fixed the parameter in the alternative.
A horizontal comparison can also be illuminating. To illustrate, in Figure 9.3(a) there is a horizontal dashed line at power. power is a useful benchmark as it is the point where the test has equal odds of rejection and acceptance. The dotted line crosses the two power curves at and . This means that the parameter must be at least standard errors above the hypothesized value for a one-sided test to have (approximate) power, and standard errors for a one-sided test.
The ratio of these values (e.g. 2.33/1.65 = 1.41) measures the relative parameter magnitude needed to achieve the same power. (Thus, for a size test to achieve power, the parameter must be larger than for a size test.) Even more interesting, the square of this ratio (e.g. ) is the increase in sample size needed to achieve the same power under fixed parameters. That is, to achieve power, a size test needs twice as many observations as a size test. This interpretation follows by the following informal argument. By definition and (9.17) . Thus holding and fixed, is proportional to .
The analysis of a two-sided t test is similar. (9.18) implies that
and thus the local power of a two-sided t test is
which is monotonically increasing in .
Theorem 9.10 Under Assumptions 7.2, 7.3,7.4, and , then
where and . For such that ,
Furthermore, for such that ,
Asymptotic Local Power, Vector Case
In this section we extend the local power analysis of the previous section to the case of vector-valued alternatives. We generalize (9.17) to vector-valued . The local parameterization is
where is .
Under (9.19),
a normal random vector with mean and covariance matrix .
Applied to the Wald statistic we find
where is a non-central chi-square random variable with non-centrality parameter . (Theorem 5.3.6.)
The convergence (9.20) shows that under the local alternatives (9.19), W . This generalizes the null asymptotic distribution which obtains as the special case . We can use this result to obtain a continuous asymptotic approximation to the power function. For any significance level set the asymptotic critical value so that . Then as ,
The asymptotic local power function depends only on , and .
Theorem 9.11 Under Assumptions 7.2, 7.3, 7.4, and , then where . Furthermore, for such that
Figure 9.3(b) plots as a function of for , and , and . The asymptotic power functions are monotonically increasing in and asymptote to one.
Figure 9.3(b) also shows the power loss for fixed non-centrality parameter as the dimensionality of the test increases. The power curves shift to the right as increases, resulting in a decrease in power. This is illustrated by the dashed line at power. The dashed line crosses the three power curves at , and . The ratio of these values correspond to the relative sample sizes needed to obtain the same power. Thus increasing the dimension of the test from to requires a increase in sample size, or an increase from to requires a increase in sample size, to maintain power.
Exercises
Exercise 9.1 Prove that if an additional regressor is added to , Theil’s adjusted increases if and only if , where is the t-ratio for and
is the homoskedasticity-formula standard error.
Exercise 9.2 You have two independent samples and both with sample sizes which satisfy and , where and . Let and be the OLS estimators of and .
Find the asymptotic distribution of as .
Find an appropriate test statistic for .
Find the asymptotic distribution of this statistic under .
Exercise 9.3 Let be a t-statistic for versus . Since under , someone suggests the test “Reject if or , where is the quantile of and is the quantile of .
- Show that the asymptotic size of the test is . (b) Is this a good test of versus ? Why or why not?
Exercise 9.4 Let be a Wald statistic for versus , where is . Since under , someone suggests the test “Reject if or , where is the quantile of and is the quantile of .
Show that the asymptotic size of the test is .
Is this a good test of versus ? Why or why not?
Exercise 9.5 Take the linear model with where both and are . Show how to test the hypotheses against .
Exercise 9.6 Suppose a researcher wants to know which of a set of 20 regressors has an effect on a variable testscore. He regresses testscore on the 20 regressors and reports the results. One of the 20 regressors (studytime) has a large t-ratio (about 2.5), while the other t-ratios are insignificant (smaller than 2 in absolute value). He argues that the data show that studytime is the key predictor for testscore. Do you agree with this conclusion? Is there a deficiency in his reasoning?
Exercise 9.7 Take the model with where is wages (dollars per hour) and is age. Describe how you would test the hypothesis that the expected wage for a 40 -year-old worker is an hour.
Exercise 9.8 You want to test against in the model with . You read a paper which estimates the model
and reports a test of against . Is this related to the test you wanted to conduct?
Exercise 9.9 Suppose a researcher uses one dataset to test a specific hypothesis against and finds that he can reject . A second researcher gathers a similar but independent dataset, uses similar methods and finds that she cannot reject . How should we (as interested professionals) interpret these mixed results?
Exercise 9.10 In Exercise you showed that as for some . Let be an estimator of .
Using this result construct a t-statistic for against .
Using the Delta Method find the asymptotic distribution of .
Use the previous result to construct a t-statistic for against .
Are the null hypotheses in (a) and (c) the same or are they different? Are the tests in (a) and (c) the same or are they different? If they are different, describe a context in which the two tests would give contradictory results.
Exercise 9.11 Consider a regression such as Table where both experience and its square are included. A researcher wants to test the hypothesis that experience does not affect mean wages and does this by computing the t-statistic for experience. Is this the correct approach? If not, what is the appropriate testing method? Exercise 9.12 A researcher estimates a regression and computes a test of against and finds a pvalue of , or “not significant”. She says “I need more data. If I had a larger sample the test will have more power and then the test will reject.” Is this interpretation correct?
Exercise 9.13 A common view is that “If the sample size is large enough, any hypothesis will be rejected.” What does this mean? Interpret and comment.
Exercise 9.14 Take the model with and parameter of interest with . Let be the least squares estimator and its variance estimator.
Write down , the asymptotic confidence interval for , in terms of , , and (the quantile of .
Show that the decision “Reject if ” is an asymptotic test of .
Exercise 9.15 You are at a seminar where a colleague presents a simulation study of a test of a hypothesis with nominal size . Based on simulation replications under the estimated size is . Your colleague says: “Unfortunately the test over-rejects.”
Do you agree or disagree with your colleague? Explain. Hint: Use an asymptotic (large B) approximation.
Suppose the number of simulation replications were yet the estimated size is still . Does your answer change?
Exercise 9.16 Consider two alternative regression models
where and have at least some different regressors. (For example, (9.21) is a wage regression on geographic variables and (2) is a wage regression on personal appearance measurements.) You want to know if model (9.21) or model (9.22) fits the data better. Define and . You decide that the model with the smaller variance fit (e.g., model (9.21) fits better if .) You decide to test for this by testing the hypothesis of equal fit against the alternative of unequal fit . For simplicity, suppose that and are observed.
Construct an estimator of .
Find the asymptotic distribution of as .
Find an estimator of the asymptotic variance of .
Propose a test of asymptotic size of against .
Suppose the test accepts . Briefly, what is your interpretation? Exercise 9.17 You have two regressors and and estimate a regression with all quadratic terms included
One of your advisors asks: Can we exclude the variable from this regression?
How do you translate this question into a statistical test? When answering these questions, be specific, not general.
What is the relevant null and alternative hypotheses?
What is an appropriate test statistic?
What is the appropriate asymptotic distribution for the statistic?
What is the rule for acceptance/rejection of the null hypothesis?
Exercise 9.18 The observed data is and . An econometrician first estimates by least squares. The econometrician next regresses the residual on , which can be written as .
Define the population parameter being estimated in this second regression.
Find the probability limit for .
Suppose the econometrician constructs a Wald statistic for from the second regression, ignoring the two-stage estimation process. Write down the formula for .
Assume . Find the asymptotic distribution for under .
If will your answer to (d) change?
Exercise 9.19 An economist estimates by least squares and tests hypothesis against . Assume and . She obtains a Wald statistic . The sample size is .
What is the correct degrees of freedom for the distribution to evaluate the significance of the Wald statistic?
The Wald statistic is very small. Indeed, is it less than the quantile of the appropriate distribution? If so, should you reject ? Explain your reasoning.
Exercise 9.20 You are reading a paper, and it reports the results from two nested OLS regressions:
Some summary statistics are reported:
You are curious if the estimate is statistically different from the zero vector. Is there a way to determine an answer from this information? Do you have to make any assumptions (beyond the standard regularity conditions) to justify your answer? Exercise 9.21 Take the model with . Describe how to test
against
Exercise 9.22 You have a random sample from the model with where is wages (dollars per hour) and is age. Describe how you would test the hypothesis that the expected wage for a 40 -year-old worker is an hour.
Exercise 9.23 Let be a test statistic such that under . Since , an asymptotic test of rejects when . An econometrician is interested in the Type I error of this test when and the data structure is well specified. She performs the following Monte Carlo experiment.
samples of size are generated from a distribution satisfying .
On each sample, the test statistic is calculated.
She calculates .
The econometrician concludes that the test is oversized in this context-it rejects too frequently under .
Is her conclusion correct, incorrect, or incomplete? Be specific in your answer.
Exercise 9.24 Do a Monte Carlo simulation. Take the model with where the parameter of interest is . Your data generating process (DGP) for the simulation is: is , is independent of , and . Set and . Generate independent samples with . On each, estimate the regression by least squares, calculate the covariance matrix using a standard (heteroskedasticity-robust) formula, and similarly estimate and its standard error. For each replication, store , and .
Does the value of matter? Explain why the described statistics are invariant to and thus setting is irrelevant.
From the 1000 replications estimate and . Discuss if you see evidence if either estimator is biased or unbiased.
From the 1000 replications estimate and . What does asymptotic theory predict these probabilities should be in large samples? What do your simulation results indicate?
Exercise 9.25 The data set Invest1993 on the textbook website contains data on 1962 U.S. firms extracted from Compustat, assembled by Bronwyn Hall, and used in Hall and Hall (1993).
The variables we use in this exercise are in the table below. The flow variables are annual sums. The stock variables are beginning of year.
|
inva |
Investment to Capital Ratio |
|
vala |
Total Market Value to Asset Ratio (Tobin’s Q) |
|
cfa |
Cash Flow to Asset Ratio |
|
debta |
Long Term Debt to Asset Ratio |
Extract the sub-sample of observations for 1987. There should be 1028 observations. Estimate a linear regression of (investment to capital ratio) on the other variables. Calculate appropriate standard errors.
Calculate asymptotic confidence intervals for the coefficients.
This regression is related to Tobin’s theory of investment, which suggests that investment should be predicted solely by (Tobin’s ). This theory predicts that the coefficient on should be positive and the others should be zero. Test the joint hypothesis that the coefficients on cash flow and debt are zero. Test the hypothesis that the coefficient on is zero. Are the results consistent with the predictions of the theory?
Now try a nonlinear (quadratic) specification. Regress on . Test the joint hypothesis that the six interaction and quadratic coefficients are zero.
Exercise 9.26 In a paper in 1963, Marc Nerlove analyzed a cost function for 145 American electric companies. Nerlov was interested in estimating a cost function: where the variables are listed in the table below. His data set Nerlove1963 is on the textbook website.
Q |
Output |
PL |
Unit price of labor |
PK |
Unit price of capital |
PF |
Unit price of fuel |
- First, estimate an unrestricted Cobb-Douglass specification
Report parameter estimates and standard errors.
What is the economic meaning of the restriction ?
Estimate (9.23) by constrained least squares imposing . Report your parameter estimates and standard errors.
Estimate (9.23) by efficient minimum distance imposing . Report your parameter estimates and standard errors.
Test using a Wald statistic.
Test using a minimum distance statistic. Exercise 9.27 In Section 8.12 we reported estimates from Mankiw, Romer and Weil (1992). We reported estimation both by unrestricted least squares and by constrained estimation, imposing the constraint that three coefficients ( and coefficients) sum to zero as implied by the Solow growth theory. Using the same dataset MRW1992 estimate the unrestricted model and test the hypothesis that the three coefficients sum to zero.
Exercise 9.28 Using the cps09mar dataset and the subsample of non-Hispanic Black individuals (race code ) test the hypothesis that marriage status does not affect mean wages.
Take the regression reported in Table 4.1. Which variables will need to be omitted to estimate a regression for this subsample?
Express the hypothesis “marriage status does not affect mean wages” as a restriction on the coefficients. How many restrictions is this?
Find the Wald (or F) statistic for this hypothesis. What is the appropriate distribution for the test statistic? Calculate the p-value of the test.
What do you conclude?
Exercise 9.29 Using the cps09mar dataset and the subsample of non-Hispanic Black individuals (race code ) and white individuals (race code ) test the hypothesis that the returns to education is common across groups.
Allow the return to education to vary across the four groups (white male, white female, Black male, Black female) by interacting dummy variables with education. Estimate an appropriate version of the regression reported in Table 4.1.
Find the Wald (or F) statistic for this hypothessis. What is the appropriate distribution for the test statistic? Calculate the p-value of the test.
What do you conclude?