Introduction
A time series is a process which is sequentially ordered over time. In this textbook we focus on discrete time series where is an integer, though there is also a considerable literature on continuoustime processes. To denote the time period it is typical to use the subscript . The time series is univariate if and multivariate if . This chapter is primarily focused on univariate time series models, though we describe the concepts for the multivariate case when the added generality does not add extra complication.
Most economic time series are recorded at discrete intervals such as annual, quarterly, monthly, weekly, or daily. The number of observaed periods per year is called the frequency. In most cases we will denote the observed sample by the periods .
Because of the sequential nature of time series we expect that observations close in calender time, e.g. and its lagged value , will be dependent. This type of dependence structure requires a different distributional theory than for cross-sectional and clustered observations since we cannot divide the sample into independent groups. Many of the issues which distinguish time series from cross-section econometrics concern the modeling of these dependence relationships.
There are many excellent textbooks for time series analysis. The encyclopedic standard is Hamilton (1994). Others include Harvey (1990), Tong (1990), Brockwell and Davis (1991), Fan and Yao (2003), Lütkepohl (2005), Enders (2014), and Kilian and Lütkepohl (2017). For textbooks on the related subject of forecasting see Granger and Newbold (1986), Granger (1989), and Elliott and Timmermann (2016).
Examples
Many economic time series are macroeconomic variables. An excellent resource for U.S. macroeconomic data are the FRED-MD and FRED-QD databases which contain a wide set of monthly and quarterly variables, assembled and maintained by the St. Louis Federal Reserve Bank. See McCracken and Ng (2016, 2021). The datasets FRED-MD and FRED-QD for 1959-2017 are posted on the textbook website. FRED-MD has 129 variables over 708 months. FRED-QD has 248 variables over 236 quarters.
When working with time series data one of the first tasks is to plot the series against time. In Figures 14.1-14.2 we plot eight example time series from FRED-QD and FRED-MD. As is conventional, the x-axis displays calendar dates (in this case years) and the y-axis displays the level of the series. The series plotted are: (1a) Real U.S. GDP ( ; (1b) U.S.-Canada exchange rate (excausx); (1c) Interest rate on U.S. 10-year Treasury bond (gs10); (1d) Real crude oil price (oilpricex); (2a) U.S. unemployment rate (unrate); (2b) U.S. real non-durables consumption growth rate (growth rate of ); (2c) U.S. CPI inflation rate

- U.S. Real GDP
.jpg)
- Interest Rate on 10-Year Treasury
.jpg)
- U.S.-Canada Exchange Rate
.jpg)
- Real Crude Oil Price
Figure 14.1: GDP, Exchange Rate, Interest Rate, Oil Price
(growth rate of cpiaucsl); (2d) S&P 500 return (growth rate of ). (1a) and (2b) are quarterly series, the rest are monthly.
Many of the plots are smooth, meaning that the neighboring values (in calendar time) are similar to one another and hence are serially correlated. Some of the plots are non-smooth, meaning that the neighboring values are less similar and hence less correlated. At least one plot (real GDP) displays an upward trend.

- U.S. Unemployment Rate
.jpg)
- U.S. Inflation Rate
.jpg)
- Consumption Growth Rate
.jpg)
- S&P 500 Return
Figure 14.2: Unemployment Rate, Consumption Growth Rate, Inflation Rate, and S&P 500 Return
Differences and Growth Rates
It is common to transform series by taking logarithms, differences, and/or growth rates. Three of the series in Figure (consumption growth, inflation [growth rate of CPI index], and S&P 500 return) are displayed as growth rates. This may be done for a number of reasons. The most credible is that this is the suitable transformation for the desired analysis.
Many aggregate series such as real GDP are transformed by taking natural logarithms. This flattens the apparent exponential growth and makes fluctuations proportionate.
The first difference of a series is
The second difference is
Higher-order differences can be defined similarly but are not used in practice. The annual, or year-onyear, change of a series with frequency is
There are several methods to calculate growth rates. The one-period growth rate is the percentage change from period to period :
The multiplication by 100 is not essential but scales so that it is a percentage. This is the transformation used for the plots in Figures (b)-(d). For quarterly data, is the quarterly growth rate. For monthly data, is the monthly growth rate.
For non-annual data the one-period growth rate (14.1) may be unappealing for interpretation. Consequently, statistical agencies commonly report “annualized” growth rates which is the annual growth which would occur if the one-period growth rate is compounded for a full year. For a series with frequency the annualized growth rate is
Notice that is a nonlinear function of .
Year-on-year growth rates are
These do not need annualization.
Growth rates are closely related to logarithmic transformations. For small growth rates, and are approximately first differences in logarithms:
For analysis using growth rates I recommend the one-period growth rates (14.1) or differenced logarithms rather than the annualized growth rates (14.2). While annualized growth rates are preferred for reporting, they are a highly nonlinear transformation which is unnatural for statistical analysis. Differenced logarithms are the most common choice and are recommended for models which combine log-levels and growth rates for then the models are linear in all variables.
Stationarity
Recall that cross-sectional observations are conventionally treated as random draws from an underlying population. This is not an appropriate model for time series processes due to serial dependence. Instead, we treat the observed sample as a realization of a dependent stochastic process. It is often useful to view as a subset of an underlying doubly-infinite sequence .
A random vector can be characterized by its distribution. A set such as can be characterized by its joint distribution. Important features of these distributions are their means, variances, and covariances. Since there is only one observed time series sample, in order to learn about these distributions there needs to be some sort of constancy. This may only hold after a suitable transformation such as growth rates (as discussed in the previous section).
The most commonly assumed form of constancy is stationarity. There are two definitions. The first is sufficient for construction of linear models.
Definition is covariance or weakly stationary if the expectation and covariance matrix are finite and are independent of , and the autocovariances
are independent of for all
In the univariate case we typically write the variance as and autocovariances as .
The expectation and variance are features of the marginal distribution of (the distribution of at a specific time period ). Their constancy as stated in the above definition means that these features of the distribution are stable over time.
The autocovariances are features of the bivariate distributions of . Their constancy as stated in the definition means that the correlation patterns between adjacent are stable over time and only depend on the number of time periods separating the variables. By symmetry we have . In the univariate case this simplifies to . The autocovariances are finite under the assumption that the covariance matrix is finite by the Cauchy-Schwarz inequality.
The autocovariances summarize the linear dependence between and its lags. A scale-free measure of linear dependence in the univariate case are the autocorrelations
Notice by symmetry that .
The second definition of stationarity concerns the entire joint distribution.
Definition 14.2 is strictly stationary if the joint distribution of is independent of for all . This is the natural generalization of the cross-section definition of identical distributions. Strict stationarity implies that the (marginal) distribution of does not vary over time. It also implies that the bivariate distributions of and multivariate distributions of are stable over time. Under the assumption of a bounded variance a strictly stationary process is covariance stationary .
For formal statistical theory we generally require the stronger assumption of strict stationarity. Therefore if we label a process as “stationary” you should interpret it as meaning “strictly stationary”.
The core meaning of both weak and strict stationarity is the same - that the distribution of is stable over time. To understand the concept it may be useful to review the plots in Figures 14.1-14.2. Are these stationary processes? If so, we would expect that the expectation and variance to be stable over time. This seems unlikely to apply to the series in Figure 14.1, as in each case it is difficult to describe what is the “typical” value of the series. Stationarity may be appropriate for the series in Figure as each oscillates with a fairly regular pattern. It is difficult, however, to know whether or not a given time series is stationary simply by examining a time series plot.
A straightforward but essential relationship is that an i.i.d. process is strictly stationary.
Theorem 14.1 If is i.i.d., then it strictly stationary.
Here are some examples of strictly stationary scalar processes. In each, is i.i.d. and .
Example 14.1 .
Example 14.2 for some random variable .
Example 14.3 for a random variable which is symmetrically distributed about 0 .
Here are some examples of processes which are not stationary.
Example 14.4 .
Example 14.5 .
Example 14.6 .
Example 14.7 .
Example 14.8 .
Example 14.9 with .
From the examples we can see that stationarity means that the distribution is constant over time. It does not mean, however, that the process has some sort of limited dependence, nor that there is an absence of periodic patterns. These restrictions are associated with the concepts of ergodicity and mixing which we shall introduce in subsequent sections.
More generally, the two classes are non-nested since strictly stationary infinite variance processes are not covariance stationary.
Convergent Series
A transformation which includes the full past history is an infinite-order moving average. For scalar and coefficients define the vector process
Many time-series models involve representations and transformations of the form (14.3).
The infinite series (14.3) exists if it is convergent, meaning that the sequence has a finite limit as . Since the inputs are random we define this as a probability limit.
Definition 14.3 The infinite series (14.3) converges almost surely if has a finite limit as with probability one. In this case we describe as convergent.
Theorem 14.3 If is strictly stationary, , and , then (14.3) converges almost surely. Furthermore, is strictly stationary.
The proof of Theorem is provided in Section .
Ergodicity
Stationarity alone is not sufficient for the weak law of large numbers as there are strictly stationary processes with no time series variation. As we described earlier, an example of a stationary process is for some random variable . This is random but constant over all time. An implication is that the sample mean of will be inconsistent for the population expectation.
What is a minimal assumption beyond stationarity so that the law of large numbers applies? This topic is called ergodicity. It is sufficiently important that it is treated as a separate area of study. We mention only a few highlights here. For a rigorous treatment see a standard textbook such as Walters (1982).
A time series is ergodic if all invariant events are trivial, meaning that any event which is unaffected by time-shifts has probability either zero or one. This definition is rather abstract and difficult to grasp but fortunately it is not needed by most economists.
A useful intuition is that if is ergodic then its sample paths will pass through all parts of the sample space never getting “stuck” in a subregion.
We will first describe the properties of ergodic series which are relevant for our needs and follow with the more rigorous technical definitions. For proofs of the results see Section 14.47.
First, many standard time series processes can be shown to be ergodic. A useful starting point is the observation that an i.i.d. sequence is ergodic.
Theorem 14.4 If is i.i.d. then it strictly stationary and ergodic.
Second, ergodicity, like stationarity, is preserved by transformation.
Theorem 14.5 If is strictly stationary and ergodic and is a random vector, then is strictly stationary and ergodic.
As an example, the infinite-order moving average transformation (14.3) is ergodic if the input is ergodic and the coefficients are absolutely convergent.
Theorem 14.6 If is strictly stationary, ergodic, , and then is strictly stationary and ergodic.
We now present a useful property. It is that the Cesàro sum of the autocovariances limits to zero.
Theorem 14.7 If is strictly stationary, ergodic, and , then
The result (14.4) can be interpreted as that the autocovariances “on average” tend to zero. Some authors have mis-stated ergodicity as implying that the covariances tend to zero but this is not correct, as (14.4) allows, for example, the non-convergent sequence . The reason why (14.4) is particularly useful is because it is sufficient for the WLLN as we discover later in Theorem 14.9.
We now give the formal definition of ergodicity for interested readers. As the concepts will not be used again most readers can safely skip this discussion.
As we stated above, by definition the series is ergodic if all invariant events are trivial. To understand this we introduce some technical definitions. First, we can write an event as where is an infinite history and . Second, the time-shift of is defined as . Thus replaces each observation in by its shifted value . A time-shift of the event is . Third, an event is called invariant if it is unaffected by a time-shift, so that . Thus replacing any history with its shifted history doesn’t change the event. Invariant events are rather special. An example of an invariant event is . Fourth, an event is called trivial if either or . You can think of trivial events as essentially non-random. Recall, by definition is ergodic if all invariant events are trivial. This means that any event which is unaffected by a time shift is trivial-is essentially non-random. For example, again consider the invariant event . If for all then . Since this does not equal 0 or 1 then is not ergodic. However, if is i.i.d. then . This is a trivial event. For to be ergodic (it is in this case) all such invariant events must be trivial.
An important technical result is that ergodicity is equivalent to the following property.
Theorem 14.8 A stationary series is ergodic iff for all events and
This result is rather deep so we do not prove it here. See Walters (1982), Corollary 1.14.2, or Davidson (1994), Theorem 14.7. The limit in (14.5) is the Cesàro sum of . The Theorem of Cesàro Means (Theorem A.4 of Probability and Statistics for Economists) shows that a sufficient condition for (14.5) is that which is known as mixing. Thus mixing implies ergodicity. Mixing, roughly, means that separated events are asymptotically independent. Ergodicity is weaker, only requiring that the events are asymptotically independent “on average”. We discuss mixing in Section 14.12.
Ergodic Theorem
The ergodic theorem is one of the most famous results in time series theory. There are actually several forms of the theorem, most of which concern almost sure convergence. For simplicity we state the theorem in terms of convergence in probability. Theorem 14.9 Ergodic Theorem.
If is strictly stationary, ergodic, and , then as ,
and
where .
The ergodic theorem shows that ergodicity is sufficient for consistent estimation. The moment condition is the same as in the WLLN for i.i.d. observations.
We now provide a proof of the ergodic theorem for the scalar case under the additional assumption that . A proof which relaxes this assumption is provided in Section 14.47.
By direct calculation
where . The double sum is over all elements of an matrix whose element is . The diagonal elements are , the first off-diagonal elements are , the second offdiagonal elements are and so on. This means that there are precisely diagonal elements equalling equalling , etc. Thus the above equals
This is a rather intruiging expression. It shows that the variance of the sample mean precisely equals (which is the variance of the sample mean under i.i.d. sampling) plus a weighted Cesàro mean of the autocovariances. The latter is zero under i.i.d. sampling but is non-zero otherwise. Theorem shows that the Cesàro mean of the autocovariances converges to zero. Let , which satisfy the conditions of the Toeplitz Lemma (Theorem A.5 of Probability and Statistics for Economists). Then
Together, we have shown that (14.8) is under ergodicity. Hence . Markov’s inequality establishes that .
Conditioning on Information Sets
In the past few sections we have introduced the concept of the infinite histories. We now consider conditional expectations given infinite histories.
First, some basics. Recall from probability theory that an outcome is an element of a sample space. An event is a set of outcomes. A probability law is a rule which assigns non-negative real numbers to events. When outcomes are infinite histories then events are collections of such histories and a probability law is a rule which assigns numbers to collections of infinite histories.
Now we wish to define a conditional expectation given an infinite past history. Specifically, we wish to define
the expected value of given the history up to time . Intuitively, is the mean of the conditional distribution, the latter reflecting the information in the history. Mathematically this cannot be defined using (2.6) as the latter requires a joint density for which does not make much sense. Instead, we can appeal to Theorem which states that the conditional expectation (14.10) exists if and the probabilities are defined. The latter events are discussed in the previous paragraph. Thus the conditional expectation is well defined.
In this textbook we have avoided measure-theoretic terminology to keep the presentation accessible, and because it is my belief that measure theory is more distracting than helpful. However, it is standard in the time series literature to follow the measure-theoretic convention of writing (14.10) as the conditional expectation given a -field. So at the risk of being overly-technical we will follow this convention and write the expectation (14.10) as where is the -field generated by the history . A -field (also known as a -algebra) is a collection of sets satisfying certain regularity conditions . See Probability and Statistics for Economists, Section 1.14. The -field generated by a random variable is the collection of measurable events involving . Similarly, the -field generated by an infinite history is the collection of measurable events involving this history. Intuitively, contains all the information available in the history . Consequently, economists typically call an information set rather than a -field. As I said, in this textbook we endeavor to avoid measure theoretic complications so will follow the economists’ label rather than the probabilists’, but use the latter’s notation as is conventional. To summarize, we will write to indicate the information set generated by an infinite history , and will write as .
We now describe some properties about information sets .
First, they are nested: . This means that information accumulates over time. Information is not lost.
Second, it is important to be precise about which variables are contained in the information set. Some economists are sloppy and refer to “the information set at time ” without specifying which variables are in the information set. It is better to be specific. For example, the information sets and are distinct even though they are both dated at time .
Third, the conditional expectations (14.10) follow the law of iterated expectations and the conditioning theorem, thus
and
Martingale Difference Sequences
An important concept in economics is unforecastability, meaning that the conditional expectation is the unconditional expectation. This is similar to the properties of a regression error. An unforecastable process is called a martingale difference sequence (MDS).
-field contains the universal set, is closed under complementation, and closed under countable unions. A MDS is defined with respect to a specific sequence of information sets . Most commonly the latter are the natural filtration (the past history of but it could be a larger information set. The only requirement is that is adapted to , meaning that .
Definition 14.4 The process is a Martingale Difference Sequence (MDS) if is adapted to , EE , and .
In words, a MDS is unforecastable in the mean. It is useful to notice that if we apply iterated expectations . Thus a MDS is mean zero.
The definition of a MDS requires the information sets to contain the information in , but is broader in the sense that it can contain more information. When no explicit definition is given it is standard to assume that is the natural filtration. However, it is best to explicitly specify the information sets so there is no confusion.
The term “martingale difference sequence” refers to the fact that the summed process is a martingale and is its first-difference. A martingale is a process which has a finite mean and
If is i.i.d. and mean zero it is a MDS but the reverse is not the case. To see this, first suppose that is i.i.d. and mean zero. It is then independent of so . Thus an i.i.d. shock is a MDS as claimed.
To show that the reverse is not true let be i.i.d. and set
By the conditioning theorem
so is a MDS. The process (14.11) is not, however, i.i.d. One way to see this is to calculate the first autocovariance of , which is
Since the covariance is non-zero, is not an independent sequence. Thus is a MDS but not i.i.d.
An important property of a square integrable MDS is that it is serially uncorrelated. To see this, observe that by iterated expectations, the conditioning theorem, and the definition of a MDS, for ,
Thus the autocovariances and autocorrelations are zero. A process that is serially uncorrelated, however, is not necessarily a MDS. Take the process with i.i.d. . The process is not a MDS because . However,
Similarly, for . Thus is serially uncorrelated. We have proved the following.
Theorem 14.10 If is a MDS and then is serially uncorrelated.
Another important special case is a homoskedastic martingale difference sequence.
Definition 14.5 The MDS is a Homoskedastic Martingale Difference Sequence if .
A homoskedastic MDS should more properly be called a conditionally homoskedastic MDS because the property concerns the conditional distribution rather than the unconditional. That is, any strictly stationary MDS satisfies a constant variance but only a homoskedastic MDS has a constant conditional variance
A homoskedatic MDS is analogous to a conditionally homoskedastic regression error. It is intermediate between a MDS and an i.i.d. sequence. Specifically, a square integrable and mean zero i.i.d. sequence is a homoskedastic MDS and the latter is a MDS.
The reverse is not the case. First, a MDS is not necessarily conditionally homoskedastic. Consider the example given previously which we showed is a MDS. It is not conditionally homoskedastic, however, because
which is time-varying. Thus this MDS is conditionally heteroskedastic. Second, a homoskedastic MDS is not necessarily i.i.d. Consider the following example. Set , where is distributed as student with degree of freedom parameter . This is scaled so that and , and is thus a homoskedastic MDS. The conditional distribution of depends on through the degree of freedom parameter. Hence is not an independent sequence.
One way to think about the difference between MDS and i.i.d. shocks is in terms of forecastability. An i.i.d. process is fully unforecastable in that no function of an i.i.d. process is forecastable. A MDS is unforecastable in the mean but other moments may be forecastable.
As we mentioned above, the definition of a MDS allows for conditional heteroskedasticity, meaning that the conditional variance may be time-varying. In financial econometrics there are many models for conditional heteroskedasticity, including autoregressive conditional heteroskedasticity (ARCH), generalized ARCH (GARCH), and stochastic volatility. A good reference for this class of models is Campbell, Lo, and MacKinlay (1997).
CLT for Martingale Differences
We are interested in an asymptotic approximation for the distribution of the normalized sample mean
where is mean zero with variance . In this section we present a CLT for the case where is a martingale difference sequence.
Theorem 14.11 MDS CLT If is a strictly stationary and ergodic martingale difference sequence and , then as ,
The conditions for Theorem are similar to the Lindeberg-Lévy CLT. The only difference is that the i.i.d. assumption has been replaced by the assumption of a strictly stationarity and ergodic MDS.
The proof of Theorem is technically advanced so we do not present the full details, but instead refer readers to Theorem of Hall and Heyde (1980) or Theorem of Davidson (1994) (which are more general than Theorem 14.11, not requiring strict stationarity). To illustrate the role of the MDS assumption we give a sketch of the proof in Section 14.47.
Mixing
For many results, including a CLT for correlated (non-MDS) series, we need a stronger restriction on the dependence between observations than ergodicity.
Recalling the property (14.5) of ergodic sequences we can measure the dependence between two events and by the discrepancy
This equals 0 when and are independent and is positive otherwise. In general, can be used to measure the degree of dependence between the events and .
Now consider the two information sets ( -fields)
The first is the history of the series up until period and the second is the history of the series starting in period and going forward. We then separate the information sets by periods, that is, take and . We can measure the degree of dependence between the information sets by taking all events in each and then taking the largest discrepancy (14.13). This is
The constants are known as the strong mixing coefficients. We say that is strong mixing if as . This means that as the time separation increases between the information sets, the degree of dependence decreases, eventually reaching independence.
From the Theorem of Cesàro Means (Theorem A.4 of Probability and Statistics for Economists), strong mixing implies (14.5) which is equivalent to ergodicity. Thus a mixing process is ergodic.
An intuition concerning mixing can be colorfully illustrated by the following example due to Halmos (1956). A martini is a drink consisting of a large portion of gin and a small part of vermouth. Suppose that you pour a serving of gin into a martini glass, pour a small amount of vermouth on top, and then stir the drink with a swizzle stick. If your stirring process is mixing, with each turn of the stick the vermouth will become more evenly distributed throughout the gin, and asymptotically (as the number of stirs tends to infinity) the vermouth and gin distributions will become independent . If so, this is a mixing process.
For applications, mixing is often useful when we can characterize the rate at which the coefficients decline to zero. There are two types of conditions which are seen in asymptotic theory: rates and summation. Rate conditions take the form or . Summation conditions take the form or .
There are alternative measures of dependence beyond (14.13) and many have been proposed. Strong mixing is one of the weakest (and thus embraces a wide set of time series processes) but is insufficiently strong for some applications. Another popular dependence measure is known as absolute regularity or -mixing. The -mixing coefficients are
Absolute regularity is stronger than strong mixing in the sense that implies , and rate conditions for the -mixing coefficients imply the same rates for the strong mixing coefficients.
One reason why mixing is useful for applications is that it is preserved by transformations.
Theorem 14.12 If has mixing coefficients and then has mixing coefficients (for . The coefficients satisfy the same summation and rate conditions as .
A limitation of the above result is that it is confined to a finite number of lags unlike the transformation results for stationarity and ergodicity.
Mixing can be a useful tool because of the following inequalities.
Of course, if you really make an asymptotic number of stirs you will never finish stirring and you won’t be able to enjoy the martini. Hence in practice it is advised to stop stirring before the number of stirs reaches infinity. Theorem 14.13 Let and be constructed from the pair .
- If and then
1. If and for then
1. If and for then
The proof is given in Section 14.47. Our next result follows fairly directly from the definition of mixing.
Theorem 14.14 If is i.i.d. then it is strong mixing and ergodic.
Linear Projection
In Chapter 2 we extensively studied the properties of linear projection models. In the context of stationary time series we can use similar tools. An important extension is to allow for projections onto infinite dimensional random vectors. For this analysis we assume that is covariance stationary.
Recall that when have a joint distribution with bounded variances the linear projection of onto (the best linear predictor) is the minimizer of and has the solution
This projection is unique and has a unique projection error .
This idea extends to any Hilbert space including the infinite past history . From the projection theorem for Hilbert spaces (see Theorem 2.3.1 of Brockwell and Davis (1991)) the projection of onto is unique and has a unique projection error
The projection error is mean zero, has finite variance , and is serially uncorrelated. By Theorem 14.2, if is strictly stationary then and are strictly stationary.
The property (14.18) implies that the projection errors are serially uncorrelated. We state these results formally.
Theorem 14.16 If is covariance stationary it has the projection equation
The projection error satisfies
and
If is strictly stationary then is strictly stationary.
White Noise
The projection error is mean zero, has a finite variance, and is serially uncorrelated. This describes what is known as a white noise process.
Definition 14.6 The process is white noise if , and for .
A MDS is white noise (Theorem 14.10) but the reverse is not true as shown by the example given in Section 14.10, which is white noise but not a MDS. Therefore, the following types of shocks are nested: i.i.d., MDS, and white noise, with i.i.d. being the most narrow class and white noise the broadest. It is helpful to observe that a white noise process can be conditionally heteroskedastic as the conditional variance is unrestricted.
The Wold Decomposition
In Section we showed that a covariance stationary process has a white noise projection error. This result can be used to express the series as an infinite linear function of the projection errors. This is a famous result known as the Wold decomposition. Theorem 14.17 The Wold Decomposition If is covariance stationary and where is the projection error variance (14.19), then has the linear representation
where are the white noise projection errors (14.18), ,
and
The Wold decomposition shows that can be written as a linear function of the white noise projection errors plus . The infinite sum in (14.20) is also known as a linear process. The Wold decomposition is a foundational result for linear time series analysis. Since any covariance stationary process can be written in this format this justifies linear models as approximations.
The series is the projection of on the history from the infinite past. It is the part of which is perfectly predictable from its past values and is called the deterministic component. In most cases , the unconditional mean of . However, it is possible for stationary processes to have more substantive deterministic components. An example is
This series is strictly stationary, mean zero, and variance one. However, it is perfectly predictable given the previous history as it simply oscillates between and 1 .
In practical applied time series analysis, deterministic components are typically excluded by assumption. We call a stationary time series non-deterministic if , a constant. In this case the Wold decomposition has a simpler form.
Theorem 14.18 If is covariance stationary and non-deterministic then has the linear representation
where satisfy (14.21) and are the white noise projection errors (14.18).
A limitation of the Wold decomposition is the restriction to linearity. While it shows that there is a valid linear approximation, it may be that a nonlinear model provides a better approximation.
For a proof of Theorem see Section 14.47.
Most authors define purely non-deterministic as the case . We allow for a non-zero mean so to accomodate practical time series applications.
Lag Operator
An algebraic construct which is useful for the analysis of time series models is the lag operator.
Definition 14.7 The lag operator L satisfies L .
Defining , we see that . In general, .
Using the lag operator the Wold decomposition can be written in the format
where is an infinite-order polynomial. The expression is compact way to write the Wold representation.
Autoregressive Wold Representation
From Theorem 14.16, satisfies a projection onto its infinite past. Theorem shows that this projection equals a linear function of the lagged projection errors. An alternative is to write the projection as a linear function of the lagged . It turns out that to obtain a unique and convergent representation we need a strengthening of the conditions.
Theorem 14.19 If is covariance stationary, non-deterministic, with Wold representation , such that for all complex , and for some integer the Wold coefficients satisfy , then has the representation
for some coefficients and . The coefficients satisfy so (14.23) is convergent.
Equation (14.23) is known as an infinite-order autoregressive representation with autoregressive coefficients .
A solution to the equation is a root of the polynomial . The assumption for means that the roots of lie outside the unit circle (the circle in the complex plane with radius one). Theorem makes the stronger restriction that is bounded away from 0 for on or within the unit circle. The need for this strengthening is less intuitive but essentially excludes the possibility of an infinite number of roots outside but arbitrarily close to the unit circle. The summability assumption on the Wold coefficients ensures convergence of the autoregressive coefficients . To understand the restriction on the roots of consider the simple case . (Below we call this a MA(1) model.) The requirement for means . Thus the assumption in Theorem bounds the coefficient strictly below 1 . Now consider an infinite polynomial case . The assumption in Theorem requires .
Theorem is attributed to Wiener and Masani (1958). For a recent treatment and proof see Corollary 6.1.17 of Politis and McElroy (2020). These authors (as is common in the literature) state their assumptions differently than we do in Theorem 14.19. First, instead of the condition on they bound from below the spectral density function of . We do not define the spectral density in this text so we restate their condition in terms of the linear process polynomial . Second, instead of the condition on the Wold coefficients they require that the autocovariances satisfy . This is implied by our stated summability condition on the (using the expression for in Section below and simplifying).
Linear Models
In the previous two sections we showed that any non-deterministic covariance stationary time series has the projection representation
and under a restriction on the projection coefficients satisfies the autoregressive representation
In both equations the errors are white noise projection errors. These representations help us understand that linear models can be used as approximations for stationary time series.
For the next several sections we reverse the analysis. We will assume a specific linear model and then study the properties of the resulting time series. In particular we will be seeking conditions under which the stated process is stationary. This helps us understand the properties of linear models. Throughout, we assume that the error is a strictly stationary and ergodic white noise process. This allows as a special case the stronger assumption that is i.i.d. but is less restrictive. In particular, it allows for conditional heteroskedasticity.
Moving Average Processes
The first-order moving average process, denoted MA(1), is
where is a strictly stationary and ergodic white noise process with var . The model is called a “moving average” because is a weighted average of the shocks and .
To see this, focus on the case . The requirement for means or . It is straightforward to calculate that a MA(1) has the following moments.
Thus the MA(1) process has a non-zero first autocorrelation with the remainder zero.
A MA(1) process with is serially correlated with each pair of adjacent observations correlated. If the pair are positively correlated, while if they are negatively correlated. The serial correlation is limited in that observations separated by multiple periods are mutually independent.
The -order moving average process, denoted , is
where . It is straightforward to calculate that a MA(q) has the following moments.
In particular, a MA(q) has non-zero autocorrelations with the remainder zero.
A MA(q) process is strictly stationary and ergodic.
A MA(q) process with moderately large can have considerably more complicated dependence relations than a MA(1) process. One specific pattern which can be induced by a MA process is smoothing. Suppose that the coefficients all equal 1. Then is a smoothed version of the shocks .
To illustrate, Figure a) displays a plot of a simulated white noise (i.i.d. ) process with observations. Figure 14.3(b) displays a plot of a MA(8) process constructed with the same innovations, with . You can see that the white noise has no predictable behavior while the is smooth.
Infinite-Order Moving Average Process
An infinite-order moving average process, denoted MA( ), also known as a linear process, is

- White Noise
.jpg)
- MA(8)
Figure 14.3: White Noise and MA(8)
where is a strictly stationary and ergodic white noise process, , and . From Theorem 14.6, is strictly stationary and ergodic. A linear process has the following moments:
First-Order Autoregressive Process
The first-order autoregressive process, denoted AR(1), is
where is a strictly stationary and ergodic white noise process with var . The AR(1) model is probably the single most important model in econometric time series analysis.
As a simple motivating example let be is the employment level (number of jobs) in an economy. Suppose that a fixed fraction of employees lose their job and a random number of new employees are hired each period. Setting and , this implies the law of motion (14.25).
To illustrate the behavior of the AR(1) process, Figure plots two simulated AR(1) processes. Each is generated using the white noise process displayed in Figure 14.3(a). The plot in Figure 14.4(a) sets

- with
.jpg)
- with
Figure 14.4: AR(1) Processes
and the plot in Figure 14.4(b) sets . You can see how both are more smooth than the white noise process and that the smoothing increases with .
Our first goal is to obtain conditions under which (14.25) is stationary. We can do so by showing that can be written as a convergent linear process and then appealing to Theorem 14.5. To find a linear process representation for we can use backward recursion. Notice that in (14.25) depends on its previous value . If we take (14.25) and lag it one period we find . Substituting this into (14.25) we find
Similarly we can lag (14.31) twice to find and can be used to substitute out . Continuing recursively times, we find
Thus equals an intercept plus the scaled initial condition and the moving average .
Now suppose we continue this recursion into the infinite past. By Theorem this converges if . The limit is provided by the following well-known result.
Theorem is absolutely convergent if The series converges by the ratio test (see Theorem A.3 of Probability and Statistics for Economists). To find the limit,
Solving, we find .
Thus the intercept in (14.26) converges to . We deduce the following:
Theorem 14.21 If and then the AR(1) process (14.25) has the convergent representation
where . The AR(1) process is strictly stationary and ergodic.
We can compute the moments of from (14.27)
One way to calculate the moments is as follows. Apply expectations to both sides of (14.25)
Stationarity implies . Solving we find . Similarly,
Stationarity implies . Solving we find . This method is useful for calculation of autocovariances and autocorrelations. For simplicity set so that and . We find
so
Furthermore,
By recursion we obtain
Thus the AR(1) process with has non-zero autocorrelations of all orders which decay to zero geometrically as increases. For the autocorrelations are all positive. For the autocorrelations alternate in sign.
We can also express the AR(1) process using the lag operator notation:
We can write this as where . We call the autoregressive polynomial of .
This suggests an alternative way of obtaining the representation (14.27). We can invert the operator (1- to write as a function of lagged . That is, suppose that the inverse operator exists. Then we can use this operator on (14.28) to find
What is the operator ? Recall from Theorem that for ,
Evaluate this expression at . We find
Setting this is
Substituted into (14.29) we obtain
which is (14.27). This is valid for .
This illustrates another important concept. We say that a polynomial is invertible if
is absolutely convergent. In particular, the autoregressive polynomial is invertible if . This is the same condition as for stationarity of the AR(1) process. Invertibility turns out to be a useful property.
Unit Root and Explosive AR(1) Processes
The AR(1) process (14.25) is stationary if . What happens otherwise?
If and the model is known as a random walk.
This is also called a unit root process, a martingale, or an integrated process. By back-substitution
Thus the initial condition does not disappear for large . Consequently the series is non-stationary. The autoregressive polynomial is not invertible, meaning that cannot be written as a convergent function of the infinite past history of .
The stochastic behavior of a random walk is noticably different from a stationary AR(1) process. It wanders up and down with equal likelihood and is not mean-reverting. While it has no tendency to return to its previous values the wandering nature of a random walk can give the illusion of mean reversion. The difference is that a random walk will take a very large number of time periods to “revert”.

- Example 1
.jpg)
- Example 2
Figure 14.5: Random Walk Processes
To illustrate, Figure plots two independent random walk processes. The plot in panel (a) uses the innovations from Figure 14.3(a). The plot in panel (b) uses an independent set of i.i.d. errors. You can see that the plot in panel (a) appears similar to the MA(8) and AR(1) plots in the sense that the series is smooth with long swings, but the difference is that the series does not return to a longterm mean. It appears to have drifted down over time. The plot in panel (b) appears to have quite different behavior, falling dramatically over a 5-year period, and then appearing to stabilize. These are both common behaviors of random walk processes. If the process is explosive. The model (14.25) with exhibits exponential growth and high sensitivity to initial conditions. Explosive autoregressive processes do not seem to be good descriptions for most economic time series. While aggregate time series such as the GDP process displayed in Figure 14.1 (a) exhibit a similar exponential growth pattern, the exponential growth can typically be removed by taking logarithms.
The case induces explosive oscillating growth and does not appear to be empirically relevant for economic applications.
Second-Order Autoregressive Process
The second-order autoregressive process, denoted , is
where is a strictly stationary and ergodic white noise process. The dynamic patterns of an AR(2) process are more complicated than an AR(1) process.
As a motivating example consider the multiplier-accelerator model of Samuelson (1939). It might be a bit dated as a model but it is simple so hopefully makes the point. Aggregate output (in an economy with no trade) is defined as Consumption Investment Gov . Suppose that individuals make their consumption decisions on the previous period’s income Consumption , firms make their investment decisions on the change in consumption Investment , and government spending is random . Then aggregate output follows
which is an process.
Using the lag operator we can write (14.31) as
or where . We call the autoregressive polynomial of .
We would like to find the conditions for the stationarity of . It turns out that it is convenient to transform the process (14.31) into a VAR(1) process (to be studied in the next chapter). Set , which is stationary if and only if is stationary. Equation (14.31) implies that satisfies
or
where and . Equation (14.33) falls in the class of VAR(1) models studied in Section 15.6. Theorem shows that the process is strictly stationary and ergodic if the innovations satisfy and all eigenvalues of are less than one in absolute value. The eigenvalues satisfy , where
and is the autoregressive polynomial. Thus the eigenvalues satisfy . Factoring the autoregressive polynomial as the solutions must equal and . The quadratic formula shows that these equal
These eigenvalues are real if and are complex conjugates otherwise. The AR(2) process is stationary if the solutions (14.34) satisfy .

Figure 14.6: Stationarity Region for
Using (14.34) to solve for the AR coefficients in terms of the eigenvalues we find and . With some algebra (the details are deferred to Section 14.47) we can show that and iff the following restrictions hold on the autoregressive coefficients:
These restrictions describe a triangle in space which is shown in Figure 14.6. Coefficients within this triangle correspond to a stationary process.
Take the Samuelson multiplier-accelerator model (14.32). You can calculate that (14.35)-(14.37) are satisfied (and thus the process is strictly stationary) if and , which are reasonable restrictions on the model parameters. The most important restriction is , which in the language of old-school macroeconomics is that the marginal propensity to consume out of income is less than one.
Furthermore, the triangle is divided into two regions as marked in Figure 14.6: the region above the parabola producing real eigenvalues , and the region below the parabola producing complex eigenvalues . This is interesting because when the eigenvalues are complex the autocorrelations of display damped oscillations. For this reason the dynamic patterns of an AR(2) can be much more complicated than those of an AR(1).
Again, take the Samuelson multiplier-accelerator model (14.32). You can calculate that if , the model has real eigenvalues iff , which holds for large and small, which are “stable” parameterizations. On the other hand, the model has complex eigenvalues (and thus oscillations) for sufficiently small and large .
Theorem 14.22 If and for defined in (14.34), or equivalently if the inequalities (14.35)-(14.37) hold, then the process (14.31) is absolutely convergent, strictly stationary, and ergodic.
The proof is presented in Section .

.jpg)
- with Complex Roots
Figure 14.7: AR(2) Processes
To illustrate, Figure displays two simulated AR(2) processes. The plot in panel (a) sets 0.4. These coefficients produce real factors so the process displays behavior similar to that of the AR(1) processes. The plot in panel (b) sets and . These coefficients produce complex factors so the process displays oscillations.
AR(p) Processes
The -order autoregressive process, denoted , is
where is a strictly stationary and ergodic white noise process.
Using the lag operator,
or where
We call the autoregressive polynomial of .
We find conditions for the stationarity of by a technique similar to that used for the AR(2) process. Set and . Equation (14.38) implies that satisfies the VAR(1) equation (14.33) with
As shown in the proof of Theorem below, the eigenvalues of are the reciprocals of the roots of the autoregressive polynomial (14.39). The roots are the solutions to . Theorem shows that stationarity of holds if the eigenvalues are less than one in absolute value, or equivalently when the roots are greater than one in absolute value. For complex numbers the equation defines the unit circle (the circle with radius of unity). We therefore say that ” lies outside the unit circle” if .
Theorem 14.23 If and all roots of lie outside the unit circle then the AR(p) process (14.38) is absolutely convergent, strictly stationary, and ergodic.
When the roots of lie outside the unit circle then the polynomial is invertible. Inverting the autoregressive representation we obtain an infinite-order moving average representation
where
and .
We have the following characterization of the moving average coefficients. Theorem 14.24 If all roots of the autoregressive polynomial satisfy then (14.41) holds with and where
The proof is presented in Section .
Impulse Response Function
The coefficients of the moving average representation
are known among economists as the impulse response function (IRF). Often the IRF is scaled by the standard deviation of . We discuss this scaling at the end of the section. In linear models the impulse response function is defined as the change in due to a shock at time . This is
This means that the coefficient can be interpreted as the magnitude of the impact of a time shock on the time variable. Plots of can be used to assess the time-propagation of shocks.
It is desirable to have a convenient method to calculate the impulse responses from the coefficients of an autoregressive model (14.38). There are two methods which we now describe.
The first uses a simple recursion. In the linear model, we can see that the coefficient is the simple derivative
We can calculate by generating a history and perturbing the shock . Since this calculation is unaffected by all other shocks we can simply set for and set . This implies the recursion
This recursion is conveniently calculated by the following simulation. Set for . Set and for . Generate for by . Then .
A second method uses the vector representation (14.33) of the AR(p) model with coefficient matrix (14.40). By recursion
Here, means the matrix product of with itself. Setting we find
By linearity
Thus the coefficient can be calculated by forming the matrix , its -fold product , and then taking the upper-left element.
As mentioned at the beginning of the section it is often desirable to scale the IRF so that it is the response to a one-deviation shock. Let and define which has unit variance. Then the IRF at lag is
ARMA and ARIMA Processes
The autoregressive-moving-average process, denoted ARMA(p,q), is
where is a strictly stationary and erogodic white noise process. It can be written using lag operator notation as .
Theorem 14.25 The ARMA(p,q) process (14.43) is strictly stationary and ergodic if all roots of lie outside the unit circle. In this case we can write
where and .
The process follows an autoregressive-integrated moving-average process, denoted ARIMA(p,d,q), if is ARMA(p,q). It can be written using lag operator notation as .
Mixing Properties of Linear Processes
There is a considerable probability literature investigating the mixing properties of time series processes. One challenge is that as autoregressive processes depend on the infinite past sequence of innovations it is not immediately obvious if they satisfy the mixing conditions.
In fact, a simple AR(1) is not necessarily mixing. A counter-example was developed by Andrews (1984). He showed that if the error has a two-point discrete distribution then an AR(1) is not strong mixing. The reason is that a discrete innovation combined with the autoregressive structure means that by observing you can deduce with near certainty the past history of the shocks . The example seems rather special but shows the need to be careful with the theory. The intuition stemming from Andrews’ example is that for an autoregressive process to be mixing it is necessary for the errors to be continuous.
A useful characterization was provided by Pham and Tran (1985).
Theorem 14.26 Suppose that satisfies the following conditions:
- is i.i.d. with for some and density which satisfies
for some .
1. All roots of lie outside the unit circle and .
- .
Then for some
and is absolutely regular and strong mixing.
The condition (14.44) is rather unusual, but specifies that has a smooth density. This rules out Andrews’ counter-example.
The summability condition on the coefficients in part 3 involves a trade-off with the number of moments . If has all moments finite (e.g. normal errors) then we can set and this condition simplifies to . For any finite the summability condition holds if has geometric decay.
It is instructive to deduce how the decay in the coefficients affects the rate for the mixing coefficients . If then so the rate is for . Mixing requires , which holds for sufficiently large . For example, if it holds for .
The primary message from this section is that linear processes, including autoregressive and ARMA processes, are mixing if the innovations satisfy suitable conditions. The mixing coefficients decay at rates related to the decay rates of the moving average coefficients.
Identification
The parameters of a model are identified if the parameters are uniquely determined by the probability distribution of the observations. In the case of linear time series analysis we typically focus on the first two moments of the observations (means, variances, covariances). We therefore say that the coefficients of a stationary MA, AR, or ARMA model are identified if they are uniquely determined by the autocorrelation function. That is, given the autocorrelation function , are the coefficients unique? It turns out that the answer is that MA and ARMA models are generally not identified. Identification is achieved by restricting the class of polynomial operators. In contrast, AR models are generally identified.
Let us start with the MA(1) model
It has first-order autocorrelation
Set . Then
Thus the MA(1) model with coefficient produces the same autocorrelations as the MA(1) model with coefficient . For example, and each yield . There is no empirical way to distinguish between the models and . Thus the coefficient is not identified.
The standard solution is to select the parameter which produces an invertible moving average polynomial. Since there is only one such choice this yields a unique solution. This may be sensible when there is reason to believe that shocks have their primary impact in the contemporaneous period and secondary (lesser) impact in the second period.
Now consider the MA(2) model
The moving average polynomial can be factored as
so that and . The process has first- and second-order autocorrelations
If we replace with we obtain
which is unchanged. Similarly if we replace with we obtain unchanged first- and secondorder autocorrelations. It follows that in the MA(2) model the factors and nor the coefficients and are identified. Consequently there are four distinct models which are identifiably indistinguishable.
This analysis extends to the MA(q) model. The factors of the MA polynomial can be replaced by their inverses and consequently the coefficients are not identified.
The standard solution is to confine attention to MA(q) models with invertible roots. This technically solves the identification dilemma. This solution corresponds to the Wold decomposition, as it is defined in terms of the projection errors which correspond to the invertible representation.
A deeper identification failure occurs in ARMA models. Consider an ARMA(1,1) model
Written in lag operator notation
The identification failure is that when then the model simplifies to . This means that the continuum of models with are all identical and the coefficients are not identified.
This extends to higher order ARMA models. Take the ARMA model written in factored lag operator notation
The models with , or all simplify to an ARMA(1,1). Thus all these models are identical and hence the coefficients are not identified.
The problem is called “cancelling roots” due to the fact that it arises when there are two identical lag polynomial factors in the AR and MA polynomials.
The standard solution in the ARMA literature is to assume that there are no cancelling roots. The trouble with this solution is that this is an assumption about the true process which is unknown. Thus it is not really a solution to the identification problem. One recommendation is to be careful when using ARMA models and be aware that highly parameterized models may not have unique coefficients.
Now consider the model (14.38). It can be written as
where and . The MDS assumption implies that and . This means that the coefficient satisfies
This equation is unique if is positive definite. It turns out that this is generically true so is unique and identified.
Theorem 14.27 In the AR(p) model (14.38), if then and is unique and identified.
The assumption means that is not purely deterministic.
We can extend this result to approximating models. That is, consider the equation (14.45) without the assumption that is necessarily a true AR(p) with a MDS error. Instead, suppose that is a non-deterministic stationary process. (Recall, non-deterministic means that where is the projection error variance (14.19).) We then define the coefficient as the best linear predictor, which is (14.46). The error is defined by the equation (14.45). This is a linear projection model.
As in the case of any linear projection, the error satisfies . This means that and for . However, the error is not necessarily a MDS nor white noise.
The coefficient is identified if . The proof of Theorem (presented in Section 14.47) does not make use of the assumption that is an with a MDS error. Rather, it only uses the assumption that . This holds in the approximate model as well under the assumption that is nondeterministic. We conclude that any approximating AR(p) is identified.
Theorem 14.28 If is strictly stationary, not purely deterministic, and , then for any and thus the coefficient vector (14.46) is identified.
Estimation of Autoregressive Models
We consider estimation of an model for stationary, ergodic, and non-deterministic . The model is (14.45) where . The coefficient is defined by projection in (14.46). The error is defined by (14.45) and has variance . This allows to follow a true AR(p) process but it is not necessary.
The least squares estimator is
This notation presumes that there are total observations on from which the first are used as initial conditions so that is defined. Effectively, this redefines the sample period. (An alternative notational choice is to define the periods so the sums range from observations to .)
The least squares residuals are . The error variance can be estimated by or .
If is strictly stationary and ergodic then so are and . They have finite means if . Under these assumptions the Ergodic Theorem implies that
and
Theorem shows that . Combined with the continuous mapping theorem we see that
It is straightforward to show that is consistent as well.
Theorem 14.29 If is strictly stationary, ergodic, not purely deterministic, and , then for any and as .
This shows that under very mild conditions the coefficients of an AR(p) model can be consistently estimated by least squares. Once again, this does not require that the series is actually an process. It holds for any stationary process with the coefficient defined by projection.
Asymptotic Distribution of Least Squares Estimator
The asymptotic distribution of the least squares estimator depends on the stochastic assumptions. In this section we derive the asymptotic distribution under the assumption of correct specification.
Specifically, we assume that the error is a MDS. An important implication of the MDS assumption is that since is part of the information set , by the conditioning theorem,
Thus is a MDS. It has a finite variance if has a finite fourth moment. To see this, by Theorem with . Using Minkowski’s Inequality,
Thus . The Cauchy-Schwarz inequality then shows that . We can then apply the martingale difference CLT (Theorem 14.11) to see that
where
Theorem 14.30 If follows the AR(p) model (14.38), all roots of lie outside the unit circle, , and , then as , where .
This is identical in form to the asymptotic distribution of least squares in cross-section regression. The implication is that asymptotic inference is the same. In particular, the asymptotic covariance matrix is estimated just as in the cross-section case.
Distribution Under Homoskedasticity
In cross-section regression we found that the covariance matrix simplifies under the assumption of conditional homoskedasticity. The same occurs in the time series context. Assume that the error is a homoskedastic MDS:
In this case
and the asymptotic distribution simplifies.
Theorem 14.31 Under the assumptions of Theorem 14.30, if in addition , then as where .
These results show that under correct specification (a MDS error) the format of the asymptotic distribution of the least squares estimator exactly parallels the cross-section case. In general the covariance matrix takes a sandwich form with components exactly equal to the cross-section case. Under conditional homoskedasticity the covariance matrix simplies exactly as in the cross-section case. A particularly useful insight which can be derived from Theorem is to focus on the simple AR(1) with no intercept. In this case so the asymptotic distribution simplifies to
Thus the asymptotic variance depends only on and is decreasing with . An intuition is that larger means greater signal and hence greater estimation precision. This result also shows that the asymptotic distribution is non-similar: the variance is a function of the parameter of interest. This means that we can expect (from advanced statistical theory) asymptotic inference to be less accurate than indicated by nominal levels.
In the context of cross-section data we argued that the homoskedasticity assumption was dubious except for occassional theoretical insight. For practical applications it is recommended to use heteroskedasticityrobust theory and methods when possible. The same argument applies to the time series case. While the distribution theory simplifies under conditional homoskedasticity there is no reason to expect homoskedasticity to hold in practice. Therefore in applications it is better to use the heteroskedasticityrobust distributional theory when possible.
Unfortunately, many existing time series textbooks report the distribution theory from (14.31). This has influenced computer software packages many of which also by default (or exclusively) use the homoskedastic distribution theory. This is unfortunate.
Asymptotic Distribution Under General Dependence
If the model (14.38) holds with white noise errors or if the is an approximation with defined as the best linear predictor then the MDS central limit theory does not apply. Instead, if is strong mixing we can use the central limit theory for mixing processes (Theorem 14.15).
Theorem 14.32 Assume that is strictly stationary, ergodic, and for some and the mixing coefficients satisfy . Let be defined as the best linear projection coefficients (14.46) from an AR(p) model with projection errors . Let be the least squares estimator of . Then
is convergent and as , where .
This result is substantially different from the cross-section case. It shows that model misspecification (including misspecifying the order of the autoregression) renders invalid the conventional “heteroskedasticityrobust” covariance matrix formula. Misspecified models do not have unforecastable (martingale difference) errors so the regression scores are potentially serially correlated. The asymptotic variance takes a sandwich form with the central component the long-run variance (recall Section 14.13) of the regression scores .
Covariance Matrix Estimation
Under the assumption of correct specification covariance matrix estimation is identical to the crosssection case. The asymptotic covariance matrix estimator under homoskedasticity is
The estimator may be used instead of .
The heteroskedasticity-robust asymptotic covariance matrix estimator is
where
Degree-of-freedom adjustments may be made as in the cross-section case though a theoretical justification has not been developed.
Standard errors for individual coefficient estimates can be formed by taking the scaled diagonal elements of
Theorem 14.33 Under the assumptions of Theorem 14.32, as and
Theorem shows that standard covariance matrix estimation is consistent and the resulting tratios are asymptotically normal. This means that for stationary autoregressions, inference can proceed using conventional regression methods.
Covariance Matrix Estimation Under General Dependence
Under the assumptions of Theorem the conventional covariance matrix estimators are inconsistent as they do not capture the serial dependence in the regression scores . To consistently estimate the covariance matrix we need an estimator of the long-run variance . The appropriate class of estimators are called Heteroskedasticity and Autocorrelation Consistent (HAC) or Heteroskedasticity and Autocorrelation Robust (HAR) covariance matrix estimators.
To understand the methods it is helpful to define the vector series and autocovariance matrices so that
Since this sum is convergent the autocovariance matrices converge to zero as . Therefore can be approximated by taking a finite sum of autocovariances such as
The number is sometimes called the lag truncation number. Other authors call it the bandwidth. An estimator of is
where . By the ergodic theorem we can show that for any . Thus for any fixed , the estimator
is consistent for .
If the serial correlation in is known to be zero after lags, then and the estimator (14.49) is consistent for . This estimator was proposed by L. Hansen and Hodrick (1980) in the context of multiperiod forecasts and by L. Hansen (1982) for the generalized method of moments.
In the general case we can select to increase with sample size . If the rate at which increases is sufficiently slow then will be consistent for , as first shown by White and Domowitz (1984).
Once we view the lag truncation number as a choice the estimator (14.49) has two potential deficiencies. One is that can change non-smoothly with which makes estimation results sensitive to the choice of . The other is that may not be positive semi-definite and is therefore not a valid covariance matrix estimator. We can see this in the simple case of scalar and . In this case which is negative when . Thus if the data are strongly negatively autocorrelated the variance estimator can be negative. A negative variance estimator means that standard errors are ill-defined (a naïve computation will produce a complex standard error which makes no sense ).
These two deficiencies can be resolved if we amend (14.49) by a weighted sum of autocovariances. Newey and West (1987b) proposed
This is a weighted sum of the autocovariances. Other weight functions can be used; the one in (14.50) is known as the Bartlett kernel . Newey and West (1987b) showed that this estimator has the algebraic property that (it is positive semi-definite), solving the negative variance problem, and it is also a smooth function of . Thus this estimator solves the two problems described above.
For to be consistent for the lag trunction number must increase to infinity with . Sufficient conditions were established by B. E. Hansen (1992).
Theorem 14.34 Under the assumptions of Theorem plus , if yet , then as
The assumption technically means that grows no faster than but this does not have a practical counterpart other than the implication that ” should be much smaller than “. The assumption on the mixing coefficients is slightly stronger than in Theorem 14.32, due to the technical nature of the derivation.
A common computational mishap is a complex standard error. This occurs when a covariance matrix estimator has negative elements on the diagonal.
See Andrews (1991b) for a description of popular options. In practice, the choice of weight function is much less important than the choice of lag truncation number . A important practical issue is how to select . One way to think about it is that impacts the precision of the estimator through its bias and variance. Since is a sample average its variance is so we expect the variance of to be of order . The bias of for is harder to calculate but depends on the rate at which the covariances decay to zero. Andrews (1991b) found that the which minimizes the mean squared error of satisfies the rate where the constant depends on the autocovariances. Practical rules to estimate and implement this optimal lag truncation parameter have been proposed by Andrews (1991b) and Newey and West (1994). Andrews’ rule for the Newey-West estimator (14.50) can be written as
where is a serial correlation parameter. When is scalar, is the first autocorrelation of . Andrews suggested using an estimator of to plug into this formula to find . An alternative is to use a default value of . For example, if we set then the Andrews rule is , which is a useful benchmark.
Testing the Hypothesis of No Serial Correlation
In some cases it may be of interest to test the hypothesis that the series is serially uncorrelated against the alternative that it is serially correlated. There have been many proposed tests of this hypothesis. The most appropriate is based on the least squares regression of an AR(p) model. Take the model
with a MDS. In this model the series is serially uncorrelated if the slope coefficients are all zero. Thus the hypothesis of interest is
The test can be implemented by a Wald or F test. Estimate the AR(p) model by least squares. Form the Wald or F statistic using the variance estimator (14.48). (The Newey-West estimator should not be used as there is no serial correlation under the null hypothesis.) Accept the hypothesis if the test statistic is smaller than a conventional critical value (or if the p-value exceeds the significance level) and reject the hypothesis otherwise.
Implementation of this test requires a choice of autoregressive order . This choice affects the power of the test. A sufficient number of lags should be included so to pick up potential serial correlation patterns but not so many that the power of the test is diluted. A reasonable choice in many applications is to set to equals , the seasonal periodicity. Thus include four lags for quarterly data or twelve lags for monthly data.
Testing for Omitted Serial Correlation
When using an AR(p) model it may be of interest to know if there is any remaining serial correlation. This can be expressed as a test for serial correlation in the error or equivalently as a test for a higher-order autogressive model. Take the model
The null hypothesis is that is serially uncorrelated and the alternative hypothesis is that it is serially correlated. We can model the latter as a mean-zero autoregressive process
The hypothesis is
A seemingly natural test for uses a two-step method. First estimate (14.52) by least squares and obtain the residuals . Second, estimate (14.53) by least squares by regressing on its lagged values and obtain the Wald or test for . This seems like a natural approach but it is muddled by the fact that the distribution of the Wald statistic is distorted by the two-step procedure. The Wald statistic is not asymptotically chi-square so it is inappropriate to make a decision based on the conventional critical values. One approach to obtain the correct asymptotic distribution is to use the generalized method of moments, treating (14.52)-(14.53) as a two-equation just-identified system.
An easier solution is to re-write (14.52)-(14.53) as a higher-order autoregression so that we can use a standard test statistic. To illustrate how this works take the case . Take (14.52) and lag the equation once:
Multiply this by and subtract from (14.52) to find
or
This is an . It simplifies to an when . Thus is equivalent to the restriction that the coefficient on is zero.
Thus testing the null hypothesis of an (14.52) against the alternative that the error is an is equivalent to testing an against an . The latter test is implemented as a t test on the coefficient on .
More generally, testing the null hypothesis of an (14.52) against the alternative that the error is an is equivalent to testing that is an against the alternative that is an . The latter test is implemented as a Wald (or F) test on the coefficients on . If the statistic is smaller than the critical values (or the p-value is larger than the significance level) then we reject the hypothesis that the is correctly specified in favor of the alternative that there is omitted serial correlation. Otherwise we accept the hypothesis that the AR(p) model is correctly specified.
Another way of deriving the test is as follows. Write (14.52) and (14.53) using lag operator notation with . Applying the operator to the first equation we obtain where . The product is a polynomial of order so is an AR(p+q).
While this discussion is all good fun, it is unclear if there is good reason to use the test described in this section. Economic theory does not typically produce hypotheses concerning the autoregressive order. Consequently there is rarely a case where there is scientific interest in testing, say, the hypothesis that a series is an AR(4) or any other specific autoregressive order. Instead, practitioners tend to use hypothesis tests for another purpose - model selection. That is, in practice users want to know “What autoregressive model should be used” in a specific application and resort to hypothesis tests to aid in this decision. This is an inappropriate use of hypothesis tests because tests are designed to provide answers to scientific questions rather than being designed to select models with good approximation properties. Instead, model selection should be based on model selection tools. One is described in the following section.
Model Selection
What is an appropriate choice of autoregressive order ? This is the problem of model selection. A good choice is to minimize the Akaike information criterion (AIC)
where is the estimated residual variance from an . The AIC is a penalized version of the Gaussian log-likelihood function for the estimated regression model. It is an estimator of the divergence between the fitted model and the true conditional density (see Section 28.4). By selecting the model with the smallest value of the AIC you select the model with the smallest estimated divergence - the highest estimated fit between the estimated and true densities.
The AIC is also a monotonic transformation of an estimator of the one-step-ahead forecast mean squared error. Thus selecting the model with the smallest value of the AIC you are selecting the model with the smallest estimated forecast error.
One possible hiccup in computing the AIC criterion for multiple models is that the sample size available for estimation changes as changes. (If you increase , you need more initial conditions.) This renders AIC comparisons inappropriate. The same sample - the same number of observations - should be used for estimation of all models. This is because AIC is a penalized likelihood, and if the samples are different then the likelihoods are not the same. The appropriate remedy is to fix a upper value , and then reserve the first as initial conditions. Then estimate the models on this (unified) sample.
The AIC of an estimated regression model can be displayed in Stata by using the estimates stats command.
Illustrations
We illustrate autoregressive estimation with three empirical examples using U.S. quarterly time series from the FRED-QD data file.
The first example is real GDP growth rates (growth rate of ). We estimate autoregressive models of order 0 through 4 using the sample from . This is a commonly estimated model in applied macroeconomic practice and is the empirical version of the Samuelson multiplier-accelerator model discussed in Section 14.24. The coefficient estimates, conventional (heteroskedasticity-robust) standard errors, Newey-West (with ) standard errors, and AIC, are displayed in Table 14.1. This sample has 152 observations. The model selected by the AIC criterion is the AR(2). The estimated model has positive and small values for the first two autoregressive coefficients. This means that quarterly output growth
This sub-sample was used for estimation as it has been argued that the growth rate of U.S. GDP slowed around this period. The goal was to estimate the model over a period of time when the series is plausibly stationary. Table 14.1: U.S. GDP AR Models
Standard errors robust to heteroskedasticity in parenthesis.
Newey-West standard errors in square brackets, with .
rates are positively correlated from quarter to quarter, but only mildly so, and most of the correlation is captured by the first lag. The coefficients of this model are in the real section of Figure 14.6, meaning that the dynamics of the estimated model do not display oscillations. The coefficients of the estimated AR(4) model are nearly identical to the AR(2) model. The conventional and Newey-West standard errors are somewhat different from one another for the AR(0) and AR(4) models, but are nearly identical to one another for the and models
Our second example is real non-durables consumption growth rates (growth rate of ). This is motivated by an influential paper by Robert Hall (1978) who argued that the permanent income hypothesis implies that changes in consumption should be unpredictable (martingale differences). To test this model Hall (1978) estimated an AR(4) model. Our estimated regression using the full sample is reported in the following equation.

Here, we report heteroskedasticity-robust standard errors. Hall’s hypothesis is that all autoregressive coefficients should be zero. We test this joint hypothesis with an statistic and find with a p-value of . This is significant at the level and close to the level. The first three autoregressive coefficients appear to be positive, but small, indicating positive serial correlation. This evidence is (mildly) inconsistent with Hall’s hypothesis. We report heteroskedasticity-robust standard errors (not Newey-West standard errors) since the purpose was to test the hypothesis of no serial correlation. Table 14.2: U.S. Inflation AR Models
Standard errors robust to heteroskedasticity in parenthesis.
Newey-West standard errors in square brackets, with .
The third example is the first difference of CPI inflation (first difference of growth rate of cpiaucsl). This is motivated by Stock and Watson (2007) who examined forecasting models for inflation rates. We estimate autoregressive models of order 1 through 8 using the full sample ( ; we report models 1 through 5 in Table 14.2. The model with the lowest AIC is the AR(4). All four estimated autoregressive coefficients are negative, most particularly the first two. The two sets of standard errors are quite similar for the AR(4) model. There are meaningful differences only for the lower order AR models.
Time Series Regression Models
Least squares regression methods can be used broadly with stationary time series. Interpretation and usefulness can depend, however, on constructive dynamic specifications. Furthermore, it is necessary to be aware of the serial correlation properties of the series involved, and to use the appropriate covariance matrix estimator when the dynamics have not been explicitly modeled.
Let be paired observations with the dependent variable and a vector of regressors including an intercept. The regressors can contain lagged so this framework includes the autoregressive model as a special case. A linear regression model takes the form
The coefficient vector is defined by projection and therefore equals
The error is defined by (14.54) and thus its properties are determined by that relationship. Implicitly the model assumes that the variables have finite second moments and , otherwise the model is not uniquely defined and a regressor could be eliminated. By the property of projection the error is uncorrelated with the regressors .
The least squares estimator of is
Under the assumption that the joint series is strictly stationary and ergodic the estimator is consistent. Under the mixing and moment conditions of Theorem the estimator is asymptotically normal with a general covariance matrix
However, under the stronger assumption that the error is a MDS the asymptotic covariance matrix simplifies. It is worthwhile investigating this condition further. The necessary condition is 0 where is an information set to which is adapted. This notation may appear somewhat odd but recall in the autoregessive context that contains variables dated time and previously, thus in this context is a “time ” variable. The reason why we need to be adapted to is that for the regression function to be the conditional mean of given must be part of the information set . Under this assumption
so is a MDS. This means we can apply the MDS CLT to obtain the asymptotic distribution.
We summarize this discussion with the following formal statement.
Theorem 14.35 If is strictly stationary, ergodic, with finite second moments, and , then in (14.55) is uniquely defined and the least squares estimator is consistent, .
If in addition, , where is an information set to which is adapted, , and , then
as , where
Alternatively, if for some , , , and the mixing coefficients for satisfy , then (14.56) holds with
Static, Distributed Lag, and Autoregressive Distributed Lag Models
In this section we describe standard linear time series regression models.
Let be paired observations with the dependent variable and an observed regressor vector which does not include lagged .
The simplest regression model is the static equation
This is (14.54) by setting . Static models are motivated to describe how and co-move. Their advantage is their simplicity. The disadvantage is that they are difficult to interpret. The coefficient is the best linear predictor (14.55) but almost certainly is dynamically misspecified. The regression of on contemporeneous is difficult to interpret without a causal framework since the two may be simultaneous. If this regression is estimated it is important that the standard errors be calculated using the Newey-West method to account for serial correlation in the error.
A model which allows the regressor to have impact over several periods is called a distributed lag (DL) model. It takes the form
It is also possible to include the contemporenous regressor . In this model the leading coefficient represents the initial impact of on represents the impact in the second period, and so on. The cumulative impact is the sum of the coefficients which is called the long-run multiplier.
The distributed lag model falls in the class (14.54) by setting . While it allows for a lagged impact of on , the model does not incorporate serial correlation so the error should be expected to be serially correlated. Thus the model is (typically) dynamically misspecified which can make interpretation difficult. It is also necessary to use Newey-West standard errors to account for the serial correlation.
A more complete model combines autoregressive and distributed lags. It takes the form
This is called an autoregressive distributed lag (AR-DL) model. It nests both the autoregressive and distributed lag models thereby combining serial correlation and dynamic impact. The AR-DL model falls in the class (14.54) by setting .
If the lag orders and are selected sufficiently large the AR-DL model will have an error which is approximately white noise in which case the model can be interpreted as dynamically well-specified and conventional standard error methods can be used.
In an AR-DL specification the long-run multiplier is
which is a nonlinear function of the coefficients.
Time Trends
Many economic time series have means which change over time. A useful way to think about this is the components model
where is the trend component and is the stochastic component. The latter can be modeled by a linear process or autoregression
The trend component is often modeled as a linear function in the time index
or a quadratic function in time
These models are typically not thought of as being literally true but rather as useful approximations.
When we write down time series models we write the index as . But in practical applications the time index corresponds to a date, e.g. . Furthermore, if the data is at a higher frequency than annual then it is incremented in fractional units. This is not of fundamental importance; it merely changes the meaning of the intercept and slope . Consequently these should not be interpreted outside of how the time index is defined.
One traditional way of dealing with time trends is to “detrend” the data. This means using an estimation method to estimate the trend and subtract it off. The simplest method is least squares linear detrending. Given the linear model
the coefficients are estimated by least squares. The detrended series is the residual . More intricate methods can be used but they have a similar flavor.
To understand the properties of the detrending method we can apply an asymptotic approximation. A time trend is not a stationary process so we should be thoughtful before applying standard theory. We will study asymptotics for non-stationary processes in more detail in Chapter 16 so our treatment here will be brief. It turns out that most of our conventional procedures work just fine with time trends (and quadratics in time) as regressors. The rates of convergence change but this does not affect anything of practical importance.
Let us demonstrate that the least squares estimator of the coefficients in (14.57) is consistent. We can write the estimator as
We need to study the behavior of the sums in the design matrix. For this the following result is useful, which follows by taking the limit of the Riemann sum for the integral .
Theorem 14.36 For any , as .
Theorem implies that
and
What is interesting about these results is that the sums require normalizations other than ! To handle this in multiple regression it is convenient to define a scaling matrix which normalizes each element in the regression by its convergence rate. Define the matrix . The first diagonal element is the intercept and second for the time trend. Then
Multiplying by we obtain
The denominator matrix satisfies
which is invertible. Setting , the numerator vector can be written as . It has variance
by Theorem if satisfies the mixing and moment conditions for the central limit theorem. This means that the numerator vector is . (It is also asymptotically normal but we defer this demonstration for now.) We conclude that
This shows that both coefficients are consistent, converges at the standard rate, and converges at the faster rate.
The consistency of the coefficient estimators (and their rates of convergence) can be used to show that linear detrending (regression of on an intercept and time trend to obtain a residual ) is consistent for the error in (14.57).
An alternative is to include a time trend in the estimated regression. If we have an autoregression, a distributed lag, or an AL-DL model, we add a time index to obtain a model of the form
Estimation by least squares is equivalent to estimation after linear detrending by the FWL theorem. Inclusion of a linear (and possibly quadratic) time trend in a regression model is typically the easiest method to incorporate time trends.
Illustration
We illustrate the models described in the previous section using a classical Phillips curve for inflation prediction. A. W. Phillips (1958) famously observed that the unemployment rate and the wage inflation rate are negatively correlated over time. Equations relating the inflation rate, or the change in the inflation rate, to macroeconomic indicators such as the unemployment rate are typically described as “Phillips curves”. A simple Phillips curve takes the form
where is price inflation and is the unemployment rate. This specification relates the change in inflation in a given period to the level of the unemployment rate in the previous period.
The least squares estimate of (14.58) using U.S. quarterly series from FRED-QD is reported in the first column of Table 14.3. Both heteroskedasticity-robust and Newey-West standard errors are reported. The Newey-West standard errors are the appropriate choice since the estimated equation is static - no modeling of the serial correlation. In this example the measured impact of the unemployment rate on inflation appears minimal. The estimate is consistent with a small effect of the unemployment rate on the inflation rate but it is not precisely estimated.
A distributed lag (DL) model takes the form
The least squares estimate of (14.59) is reported in the second column of Table 14.3. The estimates are quite different from the static model. We see large negative impacts in the first and third periods, countered by a large positive impact in the second period. The model suggests that the unemployment rate has a strong impact on the inflation rate but the long-run impact is mitigated. The long-run multiplier is reported at the bottom of the column. The point estimate of is quite small and similar to the static estimate. It implies that an increase in the unemployment rate by 5 percentage points (a typical recession) decreases the long-run annual inflation rate by about a half of a percentage point.
An AR-DL takes the form
The least squares estimate of is reported in the third column of Table 14.3. The coefficient estimates are similar to those from the distributed lag model. The point estimate of the long-run multiplier is also nearly identical but with a smaller standard error.
Granger Causality
In the AR-DL model (14.60) the unemployment rate has no predictive impact on the inflation rate under the coefficient restriction . This restriction is called Granger non-causality. When the coefficients are non-zero we say that the unemployment rate “Granger causes” the inflation rate. This definition of causality was developed by Granger (1969) and Sims (1972).
The reason why we call this “Granger causality” rather than “causality” is because this is not a structural definition. An alternative label is “predictive causality”.
To be precise, assume that we have two series . Consider the projection of onto the lagged history of both series
Table 14.3: Phillips Curve Regressions

Standard errors robust to heteroskedasticity in parenthesis.
Newey-West standard errors in square brackets with . We say that does not Granger-cause if for all . If for some then we say that Granger-causes .
It is important that the definition includes the projection on the past history of . Granger causality means that helps to predict even after the past history of has been accounted for.
The definition can alternatively be written in terms of conditional expectations rather than projections. We can say that does not Granger-cause if
Granger causality can be tested in AR-DL models using a standard Wald or F test. In the context of model (14.60) we report the F statistic for . The test rejects the hypothesis (and thus finds evidence of Granger causality) if the statistic is larger than the critical value (if the p-value is small) and fails to reject the hypothesis (and thus finds no evidence of causality) if the statistic is smaller than the critical value.
For example, in the results presented in Table the F statistic for the hypothesis using the Newey-West covariance matrix is with a p-value of . This is statistically significant at any conventional level so we can conclude that the unemployment rate has a predictively causal impact on inflation.
Granger causality should not be interpreted structurally outside the context of an economic model. For example consider the regression of GDP growth rates on stock price growth rates . We use the quarterly series from FRED-QD, estimating an AR-DL specification with two lags

The coefficients on the lagged stock price growth rates are small in magnitude but the first lag appears statistically significant. The statistic for exclusion of is with a -value of , which is highly significant. We can therefore reject the hypothesis of no Granger causality and deduce that stock prices Granger-cause GDP growth. We should be wary of concluding that this is structurally causal - that stock market movements cause output fluctuations. A more reasonable explanation from economic theory is that stock prices are forward-looking measures of expected future profits. When corporate profits are forecasted to rise the value of corporate stock rises, bidding up stock prices. Thus stock prices move in advance of actual economic activity but are not necessarily structurally causal.
.jpg)
Testing for Serial Correlation in Regression Models
Consider the problem of testing for omitted serial correlation in an AR-DL model such as
The null hypothesis is that is serially uncorrelated and the alternative hypothesis is that it is serially correlated. We can model the latter as a mean-zero autoregressive process
The hypothesis is
There are two ways to implement a test of against . The first is to estimate equations (14.61)(14.62) sequentially by least squares and construct a test for on the second equation. This test is complicated by the two-step estimation. Therefore this approach is not recommended.
The second approach is to combine equations (14.61)-(14.62) into a single model and execute the test as a restriction within this model. One way to make this combination is by using lag operator notation. Write (14.61)-(14.62) as
Applying the operator to the first equation we obtain
or
where is a order polynomial and is a order polynomial. The restriction is that these are and order polynomials. Thus we can implement a test of against by estimating an AR-DL model with and lags, and testing the exclusion of the final lags of and . This test has a conventional asymptotic distribution so is simple to implement.
The basic message is that testing for omitted serial correlation can be implement in regression models by estimating and contrasting different dynamic specifications.
Bootstrap for Time Series
Recall that the bootstrap approximates the sampling distribution of estimators and test statistics by the empirical distribution of the observations. The traditional nonparametric bootstrap is appropriate for independent observations. For dependent observations alternative methods should be used.
Bootstrapping for time series is considerably more complicated than the cross section case. Many methods have been proposed. One of the challenges is that theoretical justifications are more difficult to establish than in the independent observation case.
In this section we describe the most popular methods to implement bootstrap resampling for time series data.
Recursive Bootstrap
Estimate a complete model such as an producing coefficient estimates and residuals .
Fix the initial condition .
Simulate i.i.d. draws from the empirical distribution of the residuals .
Create the bootstrap series by the recursive formula
This construction creates bootstrap samples with the stochastic properties of the estimated AR(p) model including the auxiliary assumption that the errors are i.i.d. This method can work well if the true process is an . One flaw is that it imposes homoskedasticity on the errors which may be different than the properties of the actual . Another limitation is that it is inappropriate for AR-DL models unless the conditioning variables are strictly exogenous.
There are alternative versions of this basic method. First, instead of fixing the initial conditions at the sample values a random block can be drawn from the sample. The difference is that this produces an unconditional distribution rather than a conditional one. Second, instead of drawing the errors from the residuals a parametric (typically normal) distribution can be used. This can improve precision when sample sizes are small but otherwise is not recommended.
Pairwise Bootstrap
Write the sample as where contains the lagged values used in estimation.
Apply the traditional nonparametric bootstrap which samples pairs i.i.d. from with replacement to create the bootstrap sample.
Create the bootstrap estimates on this bootstrap sample, e.g. regress on .
This construction is essentially the traditional nonparametric bootstrap but applied to the paired sample . It does not mimic the time series correlations across observations. However, it does produce bootstrap statistics with the correct first-order asymptotic distribution under MDS errors. This method may be useful when we are interested in the distribution of nonlinear functions of the coefficient estimates and therefore desire an improvement on the Delta Method approximation.
Fixed Design Residual Bootstrap
Write the sample as where contains the lagged values used in estimation and are the residuals.
Fix the regressors at their sample values.
Simulate i.i.d. draws from the empirical distribution of the residuals .
Set .
This construction is similar to the pairwise bootstrap but imposes an i.i.d. error. It is therefore only valid when the errors are i.i.d. (and thus excludes heteroskedasticity).
Fixed Design Wild Bootstrap
Write the sample as where contains the lagged values used in estimation and are the residuals.
Fix the regressors and residuals at their sample values.
Simulate i.i.d. auxiliary random variables with mean zero and variance one. See Section for a discussion of choices.
Set and
This construction is similar to the pairwise and fixed design bootstrap combined with the wild bootstrap. This imposes the conditional mean assumption on the error but allows heteroskedasticity.
Block Bootstrap
Write the sample as where contains the lagged values used in estimation.
Divide the sample of paired observations into blocks of length .
Resample complete blocks. For each simulated sample draw blocks.
Paste the blocks together to create the bootstrap time series .
This construction allows for arbitrary stationary serial correlation, heteroskedasticity, and modelmisspecification. One challenge is that the block bootstrap is sensitive to the block length and the way that the data are partitioned into blocks. The method may also work less well in small samples. Notice that the block bootstrap with is equal to the pairwise bootstrap and the latter is the traditional nonparametric bootstrap. Thus the block bootstrap is a natural generalization of the nonparametric bootstrap.
Technical Proofs*
Proof of Theorem 14.2 Define as the history of up to time . Write . Let be the pre-image of (the vectors such that . Then
Since is strictly stationary, is independent of . This means that the distribution of is independent of . This argument can be extended to show that the distribution of is independent of . This means that is strictly stationary as claimed.
Proof of Theorem 14.3 By the Cauchy criterion for convergence (see Theorem A.2 of Probability and Statistics for Economists), converges almost surely if for all ,
An astute reader may notice that the independence of from does not follow directly from the definition of strict stationarity. Indeed, a full derivation requires a measure-theoretic treatment. See Section 1.2.B of Petersen (1983) or Section of Stout (1974). Let be this event. Its complement is
This has probability
The second equality is Markov’s inequality (B.36) and the following is the triangle inequality (B.1). The limit is zero because and . Hence for all and . This means that converges with probability one, as claimed.
Since is strictly stationary then is as well by Theorem .
Proof of Theorem 14.4 See Theorem 14.14.
Proof of Theorem 14.5 Strict stationarity follows from Theorem 14.2. Let and be the histories of and . Write . Let be an invariant event for . We want to show or 1 . The event is a collection of histories, and occurs if and and only if an associated collection of histories occur. That is, for some sets and ,
The assumption that is invariant means it is unaffected by the time shift, thus can be written as
This means the event is invariant. Since is ergodic the event has probability 0 or 1. Hence or 1 , as desired.
Proof of Theorem 14.7 Suppose is discrete with support on and without loss of generality assume . Then by Theorem
which is (14.4). This can be extended to the case of continuous distributions using the monotone convergence theorem. See Corollary of Davidson (1994).
Proof of Theorem 14.9 We show (14.6). (14.7) follows by Markov’s inequality (B.36). Without loss of generality we focus on the scalar case and assume . Fix . Pick large enough such that
which is feasible because . Define
Notice that is a bounded transformation of the ergodic series . Thus by (14.4) and (14.9) there is an sufficiently large so that
By the triangle inequality (B.1)
By another application of the triangle inequality and (14.63)
By Jensen’s inequality (B.27), direct calculation, and (14.64)
Thus
Together, (14.65), (14.66) and (14.67) show that . Since is arbitrary, this establishes (14.6) as claimed.
Proof of Theorem 14.11 (sketch) By the Cramér-Wold device (Theorem from Probability and Statistics for Economists) it is sufficient to establish the result for scalar . Let . By a Taylor series expansion, for small . Taking exponentials and rearranging we obtain the approximation
Fix . Define
Since is strictly stationary and ergodic, by the Ergodic Theorem (Theorem 14.9). Since is a MDS
To see this, define . Note . By iterated expectations
This is (14.69).
The moment generating function of is
The approximation in (14.70) is (14.68). The approximation (14.71) is . (A rigorous justification which allows this substitution in the expectation is technical.) The final equality is (14.69). This shows that the moment generating function of is approximately that of , as claimed.
The assumption that is a MDS is critical for (14.69). is a nonlinear function of the errors so a white noise assumption cannot be used instead. The MDS assumption is exactly the minimal condition needed to obtain (14.69). This is why the MDS assumption cannot be easily replaced by a milder assumption such as white noise.
Proof of Theorem 14.13.1 Without loss of generality suppose and . Set . By iterated expectations, , and again using iterated expectations
Setting , by a similar argument (14.72) is bounded by . Set . We calculate
Together, as claimed.
Proof of Theorem 14.13.2 Assume and . We first show that if then
Indeed, if the result is immediate so assume . Set and . Using the triangle inequality (B.1) and then part 1, because and ,
Also,
using the definition of . Together we have
which is (14.73) as claimed.
Now set and . Using the triangle inequality and (14.73)
Since , using (14.73) and the definition of
Using Hölder’s inequality (B.31) and the definition of
Together we have
as claimed. Proof of Theorem 14.13.3 Set which satisfies . Since is measurable, iterated expectations, using (14.73) with , the conditional Jensen’s inequality (B.28), and iterated expectations,
as claimed.
Proof of Theorem 14.15 By the Cramér-Wold device (Theorem of Probability and Statistics for Economists) it is sufficient to prove the result for the scalar case. Our proof method is based on a MDS approximation. The trick is to establish the relationship
where is a strictly stationary and ergodic MDS with and . Defining , we have
The first component on the right side is asymptotically by the MDS CLT (Theorem 14.11). The second and third terms are by Markov’s inequality (B.36).
The desired relationship (14.74) holds as follows. Set ,
and
You can verify that these definitions satisfy (14.74) given . The variable has a finite expectation because by the triangle inequality (B.1), Theorem 14.13.3, and the assumptions
the final inequality because implies .
The series in (14.76) has a finite expectation by the same calculation as for . It is a MDS since by iterated expectations
It is strictly stationary and ergodic by Theorem because it is a function of the history .
The proof is completed by showing that has a finite variance which equals . The trickiest step is to show that . Since
(as shown in (14.17)) it follows that for sufficiently large. Using (14.75) and , for sufficiently large,
Now define which is a bounded MDS. By Theorem 14.11, where . Since the sequence is uniformly integrable this implies
using . We want to show that . Suppose not. Then as , so there will be some sufficiently large such that the right-side of (14.78) exceeds the right-side of (14.77). This is a contradiction. We deduce that .
Examining (14.75), we see that since var and then . Since is stationary, we deduce that . Equation (14.75) implies var . We deduce that as claimed.
Proof of Theorem 14.17 (Sketch) Consider the projection of onto . Since the projection errors are uncorrelated, the coefficients of this projection are the bivariate projection coefficients . The leading coefficient is
using Theorem 14.16. By Bessel’s Inequality (Brockwell and Davis, 1991, Corollary 2.4.1),
because by the assumption of covariance stationarity.
The error from the projection of onto is . The fact that this can be written as (14.22) is technical. See Theorem 5.7.1 of Brockwell and Davis (1991). Proof of Theorem 14.22 In the text we showed that is sufficient for to be strictly stationary and ergodic. We now verify that is equivalent to (14.35)-(14.37). The roots are defined in (14.34). Consider separately the cases of real roots and complex roots.
Suppose that the roots are real, which occurs when . Then iff and
Equivalently, this holds iff
or equivalently iff
which are (14.35) and (14.36). and imply , which is (14.37).
Now suppose the roots are complex, which occurs when . The squared modulus of the roots are
Thus the requirement is satisfied iff , which is (14.37). and imply , so . and imply and which are (14.35) and (14.36).
Proof of Theorem 14.23 To complete the proof we need to establish that the eigenvalues of defined in (14.40) equal the reciprocals of the roots of the autoregressive polynomial of (14.39). Our goal is therefore to show that if satisfies then it satisfies .
Notice that
where , and is a lower-diagonal matrix with on the diagonal and 1 immediately below the diagonal. Notice that and by direct calculation
Using the properties of the determinant (Theorem A.1.5)
Thus if satisfies then as required.
Proof of Theorem 14.24 By the Fundamental Theorem of Algebra we can factor the autoregressive polynomial as where . By assumption . Inverting the autoregressive polynomial we obtain
with
Using the triangle inequality and the stars and bars theorem (Theorem of Probability and Statistics for Economists)
as claimed. We next verify the convergence of . Note that
By the ratio test (Theorem A.3.2 of Probability and Statistics for Economists) is convergent.
Proof of Theorem 14.27 If is singular then there is some such that . We can normalize to have a unit coefficient on (or the first non-zero coefficient other than the intercept). We then have that for some , or equivalently 0. Setting this implies . Since is the best linear predictor we must have . This implies . This contradicts the assumption . We conclude that is not singular.
Exercises
Exercise 14.1 For a scalar time series define the sample autocovariance and autocorrelation
Assume the series is strictly stationary, ergodic, strictly stationary, and .
Show that and as . (Use the Ergodic Theorem.)
Exercise 14.2 Show that if is a MDS and is -measurable then is a MDS.
Exercise 14.3 Let . Show that is a MDS.
Exercise 14.4 Continuing the previous exercise, show that if then
Express in terms of the moments of .
Exercise 14.5 A stochastic volatility model is
where and are independent i.i.d. shocks.
Write down an information set for which is a MDS.
Show that if then is strictly stationary and ergodic.
Exercise 14.6 Verify the formula for a MA(1) process.
Exercise 14.7 Verify the formula for a process.
Exercise 14.8 Suppose with i.i.d. and . Find var . Is stationary?
Exercise 14.9 Take the AR(1) model with no intercept .
Find the impulse response function .
Let be the least squares estimator of . Find an estimator of .
Let be a standard error for . Use the delta method to find a 95% asymptotic confidence interval for
Exercise 14.10 Take the AR(2) model . (a) Find expressions for the impulse responses and .
Let be the least squares estimator. Find an estimator of .
Let be the estimated covariance matrix for the coefficients. Use the delta method to find a asymptotic confidence interval for .
Exercise 14.11 Show that the models
and
are identical. Find an expression for in terms of and .
Exercise 14.12 Take the model
where and are and order lag polynomials. Show that these equations imply that
for some lag polynomial . What is the order of ?
Exercise 14.13 Suppose that where and are mutually independent i.i.d. processes.
- Show that is a MA(1) process for a white noise error .
Hint: Calculate the autocorrelation function of .
Find an expression for in terms of .
. Find .
Exercise 14.14 Suppose that
where the errors and are mutually independent i.i.d. processes. Show that is an ARMA(1,1) process.
Exercise 14.15 A Gaussian AR model is an autoregression with i.i.d. errors. Consider the Gaussian AR(1) model
with . Show that the marginal distribution of is also normal:
Hint: Use the MA representation of . Exercise 14.16 Assume that is a Gaussian as in the previous exercise. Calculate the moments
A colleague suggests estimating the parameters of the Gaussian AR(1) model by GMM applied to the corresponding sample moments. He points out that there are three moments and three parameters, so it should be identified. Can you find a flaw in his approach?
Hint: This is subtle.
Exercise 14.17 Take the nonlinear process
where is i.i.d. with strictly positive support.
Find the condition under which is strictly stationary and ergodic.
Find an explicit expression for as a function of .
Exercise 14.18 Take the quarterly series pnfix (nonresidential real private fixed investment) from FRED-QD.
Transform the series into quarterly growth rates.
Estimate an AR(4) model. Report using heteroskedastic-consistent standard errors.
Repeat using the Newey-West standard errors, using .
Comment on the magnitude and interpretation of the coefficients.
Calculate (numerically) the impulse responses for .
Exercise 14.19 Take the quarterly series oilpricex (real price of crude oil) from FRED-QD.
Transform the series by taking first differences.
Estimate an AR(4) model. Report using heteroskedastic-consistent standard errors.
Test the hypothesis that the real oil prices is a random walk by testing that the four AR coefficients jointly equal zero.
Interpret the coefficient estimates and test result.
Exercise 14.20 Take the monthly series unrate (unemployment rate) from FRED-MD.
Estimate AR(1) through AR(8) models, using the sample starting in so that all models use the same observations.
Compute the AIC for each AR model and report.
Which AR model has the lowest AIC? (d) Report the coefficient estimates and standard errors for the selected model.
Exercise 14.21 Take the quarterly series unrate (unemployment rate) and claimsx (initial claims) from FRED-QD. “Initial claims” are the number of individuals who file for unemployment insurance.
Estimate a distributed lag regression of the unemployment rate on initial claims. Use lags 1 through 4. Which standard error method is appropriate?
Estimate an autoregressive distributed lag regression of the unemployment rate on initial claims. Use lags 1 through 4 for both variables.
Test the hypothesis that initial claims does not Granger cause the unemployment rate.
Interpret your results.
Exercise 14.22 Take the quarterly series gdpcl (real GDP) and houst (housing starts) from FRED-QD. “Housing starts” are the number of new houses on which construction is started.
Transform the real GDP series into its one quarter growth rate.
Estimate a distributed lag regression of GDP growth on housing starts. Use lags 1 through 4. Which standard error method is appropriate?
Estimate an autoregressive distributed lag regression of GDP growth on housing starts. Use lags 1 through 2 for GDP growth and 1 through 4 for housing starts.
Test the hypothesis that housing starts does not Granger cause GDP growth.
Interpret your results.