14  Time Series

14.1 Introduction

A time series YtRm is a process which is sequentially ordered over time. In this textbook we focus on discrete time series where t is an integer, though there is also a considerable literature on continuoustime processes. To denote the time period it is typical to use the subscript t. The time series is univariate if m=1 and multivariate if m>1. This chapter is primarily focused on univariate time series models, though we describe the concepts for the multivariate case when the added generality does not add extra complication.

Most economic time series are recorded at discrete intervals such as annual, quarterly, monthly, weekly, or daily. The number of observaed periods s per year is called the frequency. In most cases we will denote the observed sample by the periods t=1,,n.

Because of the sequential nature of time series we expect that observations close in calender time, e.g. Yt and its lagged value Yt1, will be dependent. This type of dependence structure requires a different distributional theory than for cross-sectional and clustered observations since we cannot divide the sample into independent groups. Many of the issues which distinguish time series from cross-section econometrics concern the modeling of these dependence relationships.

There are many excellent textbooks for time series analysis. The encyclopedic standard is Hamilton (1994). Others include Harvey (1990), Tong (1990), Brockwell and Davis (1991), Fan and Yao (2003), Lütkepohl (2005), Enders (2014), and Kilian and Lütkepohl (2017). For textbooks on the related subject of forecasting see Granger and Newbold (1986), Granger (1989), and Elliott and Timmermann (2016).

14.2 Examples

Many economic time series are macroeconomic variables. An excellent resource for U.S. macroeconomic data are the FRED-MD and FRED-QD databases which contain a wide set of monthly and quarterly variables, assembled and maintained by the St. Louis Federal Reserve Bank. See McCracken and Ng (2016, 2021). The datasets FRED-MD and FRED-QD for 1959-2017 are posted on the textbook website. FRED-MD has 129 variables over 708 months. FRED-QD has 248 variables over 236 quarters.

When working with time series data one of the first tasks is to plot the series against time. In Figures 14.1-14.2 we plot eight example time series from FRED-QD and FRED-MD. As is conventional, the x-axis displays calendar dates (in this case years) and the y-axis displays the level of the series. The series plotted are: (1a) Real U.S. GDP ( gdpc1); (1b) U.S.-Canada exchange rate (excausx); (1c) Interest rate on U.S. 10-year Treasury bond (gs10); (1d) Real crude oil price (oilpricex); (2a) U.S. unemployment rate (unrate); (2b) U.S. real non-durables consumption growth rate (growth rate of pcndx ); (2c) U.S. CPI inflation rate

  1. U.S. Real GDP

  1. Interest Rate on 10-Year Treasury

  1. U.S.-Canada Exchange Rate

  1. Real Crude Oil Price

Figure 14.1: GDP, Exchange Rate, Interest Rate, Oil Price

(growth rate of cpiaucsl); (2d) S&P 500 return (growth rate of sp500 ). (1a) and (2b) are quarterly series, the rest are monthly.

Many of the plots are smooth, meaning that the neighboring values (in calendar time) are similar to one another and hence are serially correlated. Some of the plots are non-smooth, meaning that the neighboring values are less similar and hence less correlated. At least one plot (real GDP) displays an upward trend.

  1. U.S. Unemployment Rate

  1. U.S. Inflation Rate

  1. Consumption Growth Rate

  1. S&P 500 Return

Figure 14.2: Unemployment Rate, Consumption Growth Rate, Inflation Rate, and S&P 500 Return

14.3 Differences and Growth Rates

It is common to transform series by taking logarithms, differences, and/or growth rates. Three of the series in Figure 14.2 (consumption growth, inflation [growth rate of CPI index], and S&P 500 return) are displayed as growth rates. This may be done for a number of reasons. The most credible is that this is the suitable transformation for the desired analysis.

Many aggregate series such as real GDP are transformed by taking natural logarithms. This flattens the apparent exponential growth and makes fluctuations proportionate.

The first difference of a series Yt is

ΔYt=YtYt1

The second difference is

Δ2Yt=ΔYtΔYt1.

Higher-order differences can be defined similarly but are not used in practice. The annual, or year-onyear, change of a series Yt with frequency s is

ΔsYt=YtYts.

There are several methods to calculate growth rates. The one-period growth rate is the percentage change from period t1 to period t :

Qt=100(ΔYtYt1)=100(YtYt11).

The multiplication by 100 is not essential but scales Qt so that it is a percentage. This is the transformation used for the plots in Figures 14.2 (b)-(d). For quarterly data, Qt is the quarterly growth rate. For monthly data, Qt is the monthly growth rate.

For non-annual data the one-period growth rate (14.1) may be unappealing for interpretation. Consequently, statistical agencies commonly report “annualized” growth rates which is the annual growth which would occur if the one-period growth rate is compounded for a full year. For a series with frequency s the annualized growth rate is

At=100((YtYt1)s1).

Notice that At is a nonlinear function of Qt.

Year-on-year growth rates are

Gt=100(ΔsYtYts)=100(YtYts1).

These do not need annualization.

Growth rates are closely related to logarithmic transformations. For small growth rates, Qt,At and Gt are approximately first differences in logarithms:

Qt100ΔlogYtAts×100ΔlogYtGt100ΔslogYt.

For analysis using growth rates I recommend the one-period growth rates (14.1) or differenced logarithms rather than the annualized growth rates (14.2). While annualized growth rates are preferred for reporting, they are a highly nonlinear transformation which is unnatural for statistical analysis. Differenced logarithms are the most common choice and are recommended for models which combine log-levels and growth rates for then the models are linear in all variables.

14.4 Stationarity

Recall that cross-sectional observations are conventionally treated as random draws from an underlying population. This is not an appropriate model for time series processes due to serial dependence. Instead, we treat the observed sample {Y1,,Yn} as a realization of a dependent stochastic process. It is often useful to view {Y1,,Yn} as a subset of an underlying doubly-infinite sequence {,Yt1,Yt,Yt+1,}.

A random vector Yt can be characterized by its distribution. A set such as (Yt,Yt+1,,Yt+) can be characterized by its joint distribution. Important features of these distributions are their means, variances, and covariances. Since there is only one observed time series sample, in order to learn about these distributions there needs to be some sort of constancy. This may only hold after a suitable transformation such as growth rates (as discussed in the previous section).

The most commonly assumed form of constancy is stationarity. There are two definitions. The first is sufficient for construction of linear models.

Definition 14.1{Yt} is covariance or weakly stationary if the expectation μ= E[Yt] and covariance matrix Σ=var[Yt]=E[(Ytμ)(Ytμ)] are finite and are independent of t, and the autocovariances

Γ(k)=cov(Yt,Ytk)=E[(Ytμ)(Ytkμ)]

are independent of t for all k

In the univariate case we typically write the variance as σ2 and autocovariances as γ(k).

The expectation μ and variance Σ are features of the marginal distribution of Yt (the distribution of Yt at a specific time period t ). Their constancy as stated in the above definition means that these features of the distribution are stable over time.

The autocovariances Γ(k) are features of the bivariate distributions of (Yt,Ytk). Their constancy as stated in the definition means that the correlation patterns between adjacent Yt are stable over time and only depend on the number of time periods k separating the variables. By symmetry we have Γ(k)= Γ(k). In the univariate case this simplifies to γ(k)=γ(k). The autocovariances Γ(k) are finite under the assumption that the covariance matrix Σ is finite by the Cauchy-Schwarz inequality.

The autocovariances summarize the linear dependence between Yt and its lags. A scale-free measure of linear dependence in the univariate case are the autocorrelations

ρ(k)=corr(Yt,Ytk)=cov(Yt,Ytk)var[Yt]var[Yt1]=γ(k)σ2=γ(k)γ(0).

Notice by symmetry that ρ(k)=ρ(k).

The second definition of stationarity concerns the entire joint distribution.

Definition 14.2 {Yt} is strictly stationary if the joint distribution of (Yt,,Yt+) is independent of t for all . This is the natural generalization of the cross-section definition of identical distributions. Strict stationarity implies that the (marginal) distribution of Yt does not vary over time. It also implies that the bivariate distributions of (Yt,Yt+1) and multivariate distributions of (Yt,,Yt+) are stable over time. Under the assumption of a bounded variance a strictly stationary process is covariance stationary 1.

For formal statistical theory we generally require the stronger assumption of strict stationarity. Therefore if we label a process as “stationary” you should interpret it as meaning “strictly stationary”.

The core meaning of both weak and strict stationarity is the same - that the distribution of Yt is stable over time. To understand the concept it may be useful to review the plots in Figures 14.1-14.2. Are these stationary processes? If so, we would expect that the expectation and variance to be stable over time. This seems unlikely to apply to the series in Figure 14.1, as in each case it is difficult to describe what is the “typical” value of the series. Stationarity may be appropriate for the series in Figure 14.2 as each oscillates with a fairly regular pattern. It is difficult, however, to know whether or not a given time series is stationary simply by examining a time series plot.

A straightforward but essential relationship is that an i.i.d. process is strictly stationary.

Theorem 14.1 If Yt is i.i.d., then it strictly stationary.

Here are some examples of strictly stationary scalar processes. In each, et is i.i.d. and E[et]=0.

Example 14.1 Yt=et+θet1.

Example 14.2 Yt=Z for some random variable Z.

Example 14.3 Yt=(1)tZ for a random variable Z which is symmetrically distributed about 0 .

Here are some examples of processes which are not stationary.

Example 14.4 Yt=t.

Example 14.5 Yt=(1)t.

Example 14.6 Yt=cos(θt).

Example 14.7 Yt=tet.

Example 14.8 Yt=et+t1/2et1.

Example 14.9 Yt=Yt1+et with Y0=0.

From the examples we can see that stationarity means that the distribution is constant over time. It does not mean, however, that the process has some sort of limited dependence, nor that there is an absence of periodic patterns. These restrictions are associated with the concepts of ergodicity and mixing which we shall introduce in subsequent sections.

1 More generally, the two classes are non-nested since strictly stationary infinite variance processes are not covariance stationary.

14.5 Transformations of Stationary Processes

One of the important properties of strict stationarity is that it is preserved by transformation. That is, transformations of strictly stationary processes are also strictly stationary. This includes transformations which include the full history of Yt.

Theorem 14.2 If Yt is strictly stationary and Xt=ϕ(Yt,Yt1,Yt2,)Rq is a random vector then Xt is strictly stationary.

Theorem 14.2 is extremely useful both for the study of stochastic processes which are constructed from underlying errors and for the study of sample statistics such as linear regression estimators which are functions of sample averages of squares and cross-products of the original data.

We give the proof of Theorem 14.2 in Section 14.47.

14.6 Convergent Series

A transformation which includes the full past history is an infinite-order moving average. For scalar Y and coefficients aj define the vector process

Xt=j=0ajYtj.

Many time-series models involve representations and transformations of the form (14.3).

The infinite series (14.3) exists if it is convergent, meaning that the sequence j=0NajYtj has a finite limit as N. Since the inputs Yt are random we define this as a probability limit.

Definition 14.3 The infinite series (14.3) converges almost surely if j=0NajYtj has a finite limit as N with probability one. In this case we describe Xt as convergent.

Theorem 14.3 If Yt is strictly stationary, E|Y|<, and j=0|aj|<, then (14.3) converges almost surely. Furthermore, Xt is strictly stationary.

The proof of Theorem 14.3 is provided in Section 14.47.

14.7 Ergodicity

Stationarity alone is not sufficient for the weak law of large numbers as there are strictly stationary processes with no time series variation. As we described earlier, an example of a stationary process is Yt=Z for some random variable Z. This is random but constant over all time. An implication is that the sample mean of Yt=Z will be inconsistent for the population expectation.

What is a minimal assumption beyond stationarity so that the law of large numbers applies? This topic is called ergodicity. It is sufficiently important that it is treated as a separate area of study. We mention only a few highlights here. For a rigorous treatment see a standard textbook such as Walters (1982).

A time series Yt is ergodic if all invariant events are trivial, meaning that any event which is unaffected by time-shifts has probability either zero or one. This definition is rather abstract and difficult to grasp but fortunately it is not needed by most economists.

A useful intuition is that if Yt is ergodic then its sample paths will pass through all parts of the sample space never getting “stuck” in a subregion.

We will first describe the properties of ergodic series which are relevant for our needs and follow with the more rigorous technical definitions. For proofs of the results see Section 14.47.

First, many standard time series processes can be shown to be ergodic. A useful starting point is the observation that an i.i.d. sequence is ergodic.

Theorem 14.4 If YtRm is i.i.d. then it strictly stationary and ergodic.

Second, ergodicity, like stationarity, is preserved by transformation.

Theorem 14.5 If YtRm is strictly stationary and ergodic and Xt= ϕ(Yt,Yt1,Yt2,) is a random vector, then Xt is strictly stationary and ergodic.

As an example, the infinite-order moving average transformation (14.3) is ergodic if the input is ergodic and the coefficients are absolutely convergent.

Theorem 14.6 If Yt is strictly stationary, ergodic, E|Y|<, and j=0|aj|< then Xt=j=0ajYtj is strictly stationary and ergodic.

We now present a useful property. It is that the Cesàro sum of the autocovariances limits to zero.

Theorem 14.7 If YtR is strictly stationary, ergodic, and E[Y2]<, then

limn1n=1ncov(Yt,Yt+)=0.

The result (14.4) can be interpreted as that the autocovariances “on average” tend to zero. Some authors have mis-stated ergodicity as implying that the covariances tend to zero but this is not correct, as (14.4) allows, for example, the non-convergent sequence cov(Yt,Yt+)=(1). The reason why (14.4) is particularly useful is because it is sufficient for the WLLN as we discover later in Theorem 14.9.

We now give the formal definition of ergodicity for interested readers. As the concepts will not be used again most readers can safely skip this discussion.

As we stated above, by definition the series YtRm is ergodic if all invariant events are trivial. To understand this we introduce some technical definitions. First, we can write an event as A={Y~tG} where Y~t=(,Yt1,Yt,Yt+1,) is an infinite history and GRm. Second, the th time-shift of Y~t is defined as Y~t+=(,Yt1+,Yt+,Yt+1+,). Thus Y~t+ replaces each observation in Y~t by its th shifted value Yt+. A time-shift of the event A={Y~tG} is A={Y~t+G}. Third, an event A is called invariant if it is unaffected by a time-shift, so that A=A. Thus replacing any history Y~t with its shifted history Y~t+ doesn’t change the event. Invariant events are rather special. An example of an invariant event is A={max<t<Yt0}. Fourth, an event A is called trivial if either P[A]=0 or P[A]=1. You can think of trivial events as essentially non-random. Recall, by definition Yt is ergodic if all invariant events are trivial. This means that any event which is unaffected by a time shift is trivial-is essentially non-random. For example, again consider the invariant event A={max<t<Yt0}. If Yt=ZN(0,1) for all t then P[A]=P[Z0]=0.5. Since this does not equal 0 or 1 then Yt=Z is not ergodic. However, if Yt is i.i.d. N(0,1) then P[max<t<Yt0]=0. This is a trivial event. For Yt to be ergodic (it is in this case) all such invariant events must be trivial.

An important technical result is that ergodicity is equivalent to the following property.

Theorem 14.8 A stationary series YtRm is ergodic iff for all events A and B

limn1n=1nP[AB]=P[A]P[B].

This result is rather deep so we do not prove it here. See Walters (1982), Corollary 1.14.2, or Davidson (1994), Theorem 14.7. The limit in (14.5) is the Cesàro sum of P[AB]. The Theorem of Cesàro Means (Theorem A.4 of Probability and Statistics for Economists) shows that a sufficient condition for (14.5) is that P[AB]P[A]P[B] which is known as mixing. Thus mixing implies ergodicity. Mixing, roughly, means that separated events are asymptotically independent. Ergodicity is weaker, only requiring that the events are asymptotically independent “on average”. We discuss mixing in Section 14.12.

14.8 Ergodic Theorem

The ergodic theorem is one of the most famous results in time series theory. There are actually several forms of the theorem, most of which concern almost sure convergence. For simplicity we state the theorem in terms of convergence in probability. Theorem 14.9 Ergodic Theorem.

If YtRm is strictly stationary, ergodic, and EY<, then as n,

EY¯μ0

and

Y¯pμ

where μ=E[Y].

The ergodic theorem shows that ergodicity is sufficient for consistent estimation. The moment condition EY< is the same as in the WLLN for i.i.d. observations.

We now provide a proof of the ergodic theorem for the scalar case under the additional assumption that var[Y]=σ2<. A proof which relaxes this assumption is provided in Section 14.47.

By direct calculation

var[Y¯]=1n2t=1nj=1nγ(tj)

where γ()=cov(Yt,Yt+). The double sum is over all elements of an n×n matrix whose tjth element is γ(tj). The diagonal elements are γ(0)=σ2, the first off-diagonal elements are γ(1), the second offdiagonal elements are γ(2) and so on. This means that there are precisely n diagonal elements equalling σ2,2(n1) equalling γ(1), etc. Thus the above equals

var[Y¯]=1n2(nσ2+2(n1)γ(1)+2(n2)γ(2)++2γ(n1))=σ2n+2n=1n(1n)γ().

This is a rather intruiging expression. It shows that the variance of the sample mean precisely equals σ2/n (which is the variance of the sample mean under i.i.d. sampling) plus a weighted Cesàro mean of the autocovariances. The latter is zero under i.i.d. sampling but is non-zero otherwise. Theorem 14.7 shows that the Cesàro mean of the autocovariances converges to zero. Let wn=2(/n2), which satisfy the conditions of the Toeplitz Lemma (Theorem A.5 of Probability and Statistics for Economists). Then

2n=1n(1n)γ()=2n2=1n1j=1γ(j)==1n1wn(1j=1γ(j))0

Together, we have shown that (14.8) is o(1) under ergodicity. Hence var[Y¯]0. Markov’s inequality establishes that Y¯pμ.

14.9 Conditioning on Information Sets

In the past few sections we have introduced the concept of the infinite histories. We now consider conditional expectations given infinite histories.

First, some basics. Recall from probability theory that an outcome is an element of a sample space. An event is a set of outcomes. A probability law is a rule which assigns non-negative real numbers to events. When outcomes are infinite histories then events are collections of such histories and a probability law is a rule which assigns numbers to collections of infinite histories.

Now we wish to define a conditional expectation given an infinite past history. Specifically, we wish to define

Et1[Yt]=E[YtYt1,Yt2,]

the expected value of Yt given the history Y~t1=(Yt1,Yt2,) up to time t. Intuitively, Et1[Yt] is the mean of the conditional distribution, the latter reflecting the information in the history. Mathematically this cannot be defined using (2.6) as the latter requires a joint density for (Yt,Yt1,Yt2,) which does not make much sense. Instead, we can appeal to Theorem 2.13 which states that the conditional expectation (14.10) exists if E|Yt|< and the probabilities P[Y~t1A] are defined. The latter events are discussed in the previous paragraph. Thus the conditional expectation is well defined.

In this textbook we have avoided measure-theoretic terminology to keep the presentation accessible, and because it is my belief that measure theory is more distracting than helpful. However, it is standard in the time series literature to follow the measure-theoretic convention of writing (14.10) as the conditional expectation given a σ-field. So at the risk of being overly-technical we will follow this convention and write the expectation (14.10) as E[YtFt1] where Ft1=σ(Y~t1) is the σ-field generated by the history Y~t1. A σ-field (also known as a σ-algebra) is a collection of sets satisfying certain regularity conditions 2. See Probability and Statistics for Economists, Section 1.14. The σ-field generated by a random variable Y is the collection of measurable events involving Y. Similarly, the σ-field generated by an infinite history is the collection of measurable events involving this history. Intuitively, Ft1 contains all the information available in the history Y~t1. Consequently, economists typically call Ft1 an information set rather than a σ-field. As I said, in this textbook we endeavor to avoid measure theoretic complications so will follow the economists’ label rather than the probabilists’, but use the latter’s notation as is conventional. To summarize, we will write Ft=σ(Yt,Yt1,) to indicate the information set generated by an infinite history (Yt,Yt1,), and will write (14.10) as E[YtFt1].

We now describe some properties about information sets Ft.

First, they are nested: Ft1F. This means that information accumulates over time. Information is not lost.

Second, it is important to be precise about which variables are contained in the information set. Some economists are sloppy and refer to “the information set at time t” without specifying which variables are in the information set. It is better to be specific. For example, the information sets F1t= σ(Yt,Yt1,) and F2t=σ(Yt,Xt,Yt1,Xt1) are distinct even though they are both dated at time t.

Third, the conditional expectations (14.10) follow the law of iterated expectations and the conditioning theorem, thus

E[E[YtFt1]Ft2]=E[YtFt2]E[E[YtFt1]]=E[Yt]

and

E[Yt1YtFt1]=Yt1E[YtFt1]

14.10 Martingale Difference Sequences

An important concept in economics is unforecastability, meaning that the conditional expectation is the unconditional expectation. This is similar to the properties of a regression error. An unforecastable process is called a martingale difference sequence (MDS).

2 Aσ-field contains the universal set, is closed under complementation, and closed under countable unions. A MDS et is defined with respect to a specific sequence of information sets Ft. Most commonly the latter are the natural filtration Ft=σ(et,et1,) (the past history of et) but it could be a larger information set. The only requirement is that et is adapted to Ft, meaning that E[etFt]=et.

Definition 14.4 The process (et,Ft) is a Martingale Difference Sequence (MDS) if et is adapted to Ft, EE |et|<, and E[etFt1]=0.

In words, a MDS et is unforecastable in the mean. It is useful to notice that if we apply iterated expectations E[et]=E[E[etFt1]]=0. Thus a MDS is mean zero.

The definition of a MDS requires the information sets Ft to contain the information in et, but is broader in the sense that it can contain more information. When no explicit definition is given it is standard to assume that Ft is the natural filtration. However, it is best to explicitly specify the information sets so there is no confusion.

The term “martingale difference sequence” refers to the fact that the summed process St=j=1tej is a martingale and et is its first-difference. A martingale St is a process which has a finite mean and E[StFt1]=St1

If et is i.i.d. and mean zero it is a MDS but the reverse is not the case. To see this, first suppose that et is i.i.d. and mean zero. It is then independent of Ft1=σ(et1,et2,) so E[etFt1]=E[et]=0. Thus an i.i.d. shock is a MDS as claimed.

To show that the reverse is not true let ut be i.i.d. N(0,1) and set

et=utut1

By the conditioning theorem

E[etFt1]=ut1E[utFt1]=0

so et is a MDS. The process (14.11) is not, however, i.i.d. One way to see this is to calculate the first autocovariance of et2, which is

cov(et2,et12)=E[et2et12]E[et2]E[et12]=E[ut2]E[ut14]E[ut22]1=20.

Since the covariance is non-zero, et is not an independent sequence. Thus et is a MDS but not i.i.d.

An important property of a square integrable MDS is that it is serially uncorrelated. To see this, observe that by iterated expectations, the conditioning theorem, and the definition of a MDS, for k>0,

cov(et,etk)=E[etetk]=E[E[etetkFt1]]=E[E[etFt1]etk]=E[0etk]=0.

Thus the autocovariances and autocorrelations are zero. A process that is serially uncorrelated, however, is not necessarily a MDS. Take the process et=ut+ ut1ut2 with ut i.i.d. N(0,1). The process et is not a MDS because E[etFt1]=ut1ut20. However,

cov(et,et1)=E[etet1]=E[(ut+ut1ut2)(ut1+ut2ut3)]=E[utut1+utut2ut3+ut12ut2+ut1ut22ut3]=E[ut]E[ut1]+E[ut]E[ut2]E[ut3]+E[ut12]E[ut2]+E[ut1]E[ut22]E[ut3]=0.

Similarly, cov(et,etk)=0 for k0. Thus et is serially uncorrelated. We have proved the following.

Theorem 14.10 If (et,Ft) is a MDS and E[et2]< then et is serially uncorrelated.

Another important special case is a homoskedastic martingale difference sequence.

Definition 14.5 The MDS (et,Ft) is a Homoskedastic Martingale Difference Sequence if E[et2Ft1]=σ2.

A homoskedastic MDS should more properly be called a conditionally homoskedastic MDS because the property concerns the conditional distribution rather than the unconditional. That is, any strictly stationary MDS satisfies a constant variance E[et2] but only a homoskedastic MDS has a constant conditional variance E[et2Ft1]

A homoskedatic MDS is analogous to a conditionally homoskedastic regression error. It is intermediate between a MDS and an i.i.d. sequence. Specifically, a square integrable and mean zero i.i.d. sequence is a homoskedastic MDS and the latter is a MDS.

The reverse is not the case. First, a MDS is not necessarily conditionally homoskedastic. Consider the example et=utut1 given previously which we showed is a MDS. It is not conditionally homoskedastic, however, because

E[et2Ft1]=ut12E[ut2Ft1]=ut12

which is time-varying. Thus this MDS et is conditionally heteroskedastic. Second, a homoskedastic MDS is not necessarily i.i.d. Consider the following example. Set et=12/ηt1Tt, where Tt is distributed as student t with degree of freedom parameter ηt1=2+et12. This is scaled so that E[etFt1]=0 and E[et2Ft1]=1, and is thus a homoskedastic MDS. The conditional distribution of et depends on et1 through the degree of freedom parameter. Hence et is not an independent sequence.

One way to think about the difference between MDS and i.i.d. shocks is in terms of forecastability. An i.i.d. process is fully unforecastable in that no function of an i.i.d. process is forecastable. A MDS is unforecastable in the mean but other moments may be forecastable.

As we mentioned above, the definition of a MDS et allows for conditional heteroskedasticity, meaning that the conditional variance σt2=E[et2Ft1] may be time-varying. In financial econometrics there are many models for conditional heteroskedasticity, including autoregressive conditional heteroskedasticity (ARCH), generalized ARCH (GARCH), and stochastic volatility. A good reference for this class of models is Campbell, Lo, and MacKinlay (1997).

14.11 CLT for Martingale Differences

We are interested in an asymptotic approximation for the distribution of the normalized sample mean

Sn=1nt=1nut

where ut is mean zero with variance E[utut]=Σ<. In this section we present a CLT for the case where ut is a martingale difference sequence.

Theorem 14.11 MDS CLT If ut is a strictly stationary and ergodic martingale difference sequence and E[utut]=Σ<, then as n,

Sn=1nt=1nutdN(0,Σ)

The conditions for Theorem 14.11 are similar to the Lindeberg-Lévy CLT. The only difference is that the i.i.d. assumption has been replaced by the assumption of a strictly stationarity and ergodic MDS.

The proof of Theorem 14.11 is technically advanced so we do not present the full details, but instead refer readers to Theorem 3.2 of Hall and Heyde (1980) or Theorem 25.3 of Davidson (1994) (which are more general than Theorem 14.11, not requiring strict stationarity). To illustrate the role of the MDS assumption we give a sketch of the proof in Section 14.47.

14.12 Mixing

For many results, including a CLT for correlated (non-MDS) series, we need a stronger restriction on the dependence between observations than ergodicity.

Recalling the property (14.5) of ergodic sequences we can measure the dependence between two events A and B by the discrepancy

α(A,B)=|P[AB]P[A]P[B]|.

This equals 0 when A and B are independent and is positive otherwise. In general, α(A,B) can be used to measure the degree of dependence between the events A and B.

Now consider the two information sets ( σ-fields)

Ft=σ(,Yt1,Yt)Ft=σ(Yt,Yt+1,).

The first is the history of the series up until period t and the second is the history of the series starting in period t and going forward. We then separate the information sets by periods, that is, take Ft and Ft. We can measure the degree of dependence between the information sets by taking all events in each and then taking the largest discrepancy (14.13). This is

α()=supAFt,BFtα(A,B).

The constants α() are known as the strong mixing coefficients. We say that Yt is strong mixing if α()0 as . This means that as the time separation increases between the information sets, the degree of dependence decreases, eventually reaching independence.

From the Theorem of Cesàro Means (Theorem A.4 of Probability and Statistics for Economists), strong mixing implies (14.5) which is equivalent to ergodicity. Thus a mixing process is ergodic.

An intuition concerning mixing can be colorfully illustrated by the following example due to Halmos (1956). A martini is a drink consisting of a large portion of gin and a small part of vermouth. Suppose that you pour a serving of gin into a martini glass, pour a small amount of vermouth on top, and then stir the drink with a swizzle stick. If your stirring process is mixing, with each turn of the stick the vermouth will become more evenly distributed throughout the gin, and asymptotically (as the number of stirs tends to infinity) the vermouth and gin distributions will become independent 3. If so, this is a mixing process.

For applications, mixing is often useful when we can characterize the rate at which the coefficients α() decline to zero. There are two types of conditions which are seen in asymptotic theory: rates and summation. Rate conditions take the form α()=O(r) or α()=o(r). Summation conditions take the form =0α()r< or =0sα()r<.

There are alternative measures of dependence beyond (14.13) and many have been proposed. Strong mixing is one of the weakest (and thus embraces a wide set of time series processes) but is insufficiently strong for some applications. Another popular dependence measure is known as absolute regularity or β-mixing. The β-mixing coefficients are

β()=supAFtE|P[AFt]P[A]|.

Absolute regularity is stronger than strong mixing in the sense that β()0 implies α()0, and rate conditions for the β-mixing coefficients imply the same rates for the strong mixing coefficients.

One reason why mixing is useful for applications is that it is preserved by transformations.

Theorem 14.12 If Yt has mixing coefficients αY() and Xt= ϕ(Yt,Yt1,Yt2,,Ytq) then Xt has mixing coefficients αX()αY(q) (for q). The coefficients αX() satisfy the same summation and rate conditions as αY().

A limitation of the above result is that it is confined to a finite number of lags unlike the transformation results for stationarity and ergodicity.

Mixing can be a useful tool because of the following inequalities.

3 Of course, if you really make an asymptotic number of stirs you will never finish stirring and you won’t be able to enjoy the martini. Hence in practice it is advised to stop stirring before the number of stirs reaches infinity. Theorem 14.13 Let Ft and Ft be constructed from the pair (Xt,Zt).

  1. If |Xt|C1 and |Zt|C2 then

|cov(Xt,Zt)|4C1C2α().

1. If E|Xt|r< and E|Zt|q< for 1/r+1/q<1 then

|cov(Xt,Zt)|8(E|Xt|r)1/r(E|Zt|q)1/qα()11/r1/q.

1. If E[Zt]=0 and E|Zt|r< for r1 then

E|E[ZtFt]|6(E|Zt|r)1/rα()11/r.

The proof is given in Section 14.47. Our next result follows fairly directly from the definition of mixing.

Theorem 14.14 If Yt is i.i.d. then it is strong mixing and ergodic.

14.13 CLT for Correlated Observations

In this section we develop a CLT for the normalized mean Sn defined in (14.12) allowing the variables ut to be serially correlated.

In (14.8) we found that in the scalar case

var[Sn]=σ2+2=1n(1n)γ()

where σ2=var[ut] and γ()=cov(ut,ut). Since γ()=γ() this can be written as

var[Sn]==nn(1||n)γ().

In the vector case define the variance Σ=E[utut] and the matrix covariance Γ()=E[utut] which satisfies Γ()=Γ(). We obtain by a calculation analogous to (14.14)

var[Sn]=Σ+=1n(1n)(Γ()+Γ())==nn(1||n)Γ().

A necessary condition for Sn to converge to a normal distribution is that the variance (14.15) converges to a limit. Indeed, as n

=1n(1n)Γ()=1n=1n1j=1Γ(j)=0Γ()

where the convergence holds by the Theorem of Cesàro Means if the limit in (14.16) is convergent. A necessary condition for this to hold is that the covariances Γ() decline to zero as . A sufficient condition is that the covariances are absolutely summable which can be verified using a mixing inequality. Using the triangle inequality (B.16) and Theorem 14.13.2, for any r>2

=0Γ()8(Eutr)2/r=0α()12/r.

This implies that (14.15) converges if Eutr< and =0α()12/r<. We conclude that under these assumptions

var[Sn]=Γ()= def Ω

The matrix Ω plays a special role in the inference theory for tme series. It is often called the long-run variance of ut as it is the variance of sample means in large samples.

It turns out that these conditions are sufficient for the CLT.

Theorem 14.15 If ut is strictly stationary with mixing coefficients α(),E[ut]= 0 , for some r>2, Eutr< and =0α()12/r<, then (14.17) is convergent and Sn=n1/2t=1nutdN(0,Ω)

The proof is in Section 14.47.

The theorem requires r>2 finite moments which is stronger than the MDS CLT. This r does not need to be an integer, meaning that the theorem holds under slightly more than two finite moments. The summability condition on the mixing coefficients in Theorem 14.15 is considerably stronger than ergodicity. There is a trade-off involving the choice of r. A larger r means more moments are required finite but a slower decay in the coefficients α() is allowed. Smaller r is less restrictive regarding moments but requires a faster decay rate in the mixing coefficients.

14.14 Linear Projection

In Chapter 2 we extensively studied the properties of linear projection models. In the context of stationary time series we can use similar tools. An important extension is to allow for projections onto infinite dimensional random vectors. For this analysis we assume that Yt is covariance stationary.

Recall that when (Y,X) have a joint distribution with bounded variances the linear projection of Y onto X (the best linear predictor) is the minimizer of S(β)=E[(YβX)2] and has the solution

P[YX]=X(E[XX])1E[XY]

This projection is unique and has a unique projection error e=YP[YX].

This idea extends to any Hilbert space including the infinite past history Y~t1=(,Yt2,Yt1). From the projection theorem for Hilbert spaces (see Theorem 2.3.1 of Brockwell and Davis (1991)) the projection Pt1[Yt]=P[YtY~t1] of Yt onto Y~t1 is unique and has a unique projection error

et=YtPt1[Yt].

The projection error is mean zero, has finite variance σ2=E[et2]E[Yt2]<, and is serially uncorrelated. By Theorem 14.2, if Yt is strictly stationary then Pt1[Yt] and et are strictly stationary.

The property (14.18) implies that the projection errors are serially uncorrelated. We state these results formally.

Theorem 14.16 If YtR is covariance stationary it has the projection equation

Yt=Pt1[Yt]+et.

The projection error et satisfies

E[et]=0E[etjet]=0j1

and

σ2=E[et2]E[Yt2]<.

If Yt is strictly stationary then et is strictly stationary.

14.15 White Noise

The projection error et is mean zero, has a finite variance, and is serially uncorrelated. This describes what is known as a white noise process.

Definition 14.6 The process et is white noise if E[et]=0,E[et2]=σ2<, and cov(et,etk)=0 for k0.

A MDS is white noise (Theorem 14.10) but the reverse is not true as shown by the example et= ut+ut1ut2 given in Section 14.10, which is white noise but not a MDS. Therefore, the following types of shocks are nested: i.i.d., MDS, and white noise, with i.i.d. being the most narrow class and white noise the broadest. It is helpful to observe that a white noise process can be conditionally heteroskedastic as the conditional variance is unrestricted.

14.16 The Wold Decomposition

In Section 14.14 we showed that a covariance stationary process has a white noise projection error. This result can be used to express the series as an infinite linear function of the projection errors. This is a famous result known as the Wold decomposition. Theorem 14.17 The Wold Decomposition If Yt is covariance stationary and σ2>0 where σ2 is the projection error variance (14.19), then Yt has the linear representation

Yt=μt+j=0bjetj

where et are the white noise projection errors (14.18), b0=1,

j=1bj2<,

and

μt=limmPtm[Yt]

The Wold decomposition shows that Yt can be written as a linear function of the white noise projection errors plus μt. The infinite sum in (14.20) is also known as a linear process. The Wold decomposition is a foundational result for linear time series analysis. Since any covariance stationary process can be written in this format this justifies linear models as approximations.

The series μt is the projection of Yt on the history from the infinite past. It is the part of Yt which is perfectly predictable from its past values and is called the deterministic component. In most cases μt=μ, the unconditional mean of Yt. However, it is possible for stationary processes to have more substantive deterministic components. An example is

μt={(1)t with probability 1/2(1)t+1 with probability 1/2.

This series is strictly stationary, mean zero, and variance one. However, it is perfectly predictable given the previous history as it simply oscillates between 1 and 1 .

In practical applied time series analysis, deterministic components are typically excluded by assumption. We call a stationary time series non-deterministic 4 if μt=μ, a constant. In this case the Wold decomposition has a simpler form.

Theorem 14.18 If Yt is covariance stationary and non-deterministic then Yt has the linear representation

Yt=μ+j=0bjetj,

where bj satisfy (14.21) and et are the white noise projection errors (14.18).

A limitation of the Wold decomposition is the restriction to linearity. While it shows that there is a valid linear approximation, it may be that a nonlinear model provides a better approximation.

For a proof of Theorem 14.17 see Section 14.47.

4 Most authors define purely non-deterministic as the case μt=0. We allow for a non-zero mean so to accomodate practical time series applications.

14.17 Lag Operator

An algebraic construct which is useful for the analysis of time series models is the lag operator.

Definition 14.7 The lag operator L satisfies L Yt=Yt1.

Defining L2=LL, we see that L2Yt=LYt1=Yt2. In general, LkYt=Ytk.

Using the lag operator the Wold decomposition can be written in the format

Yt=μ+b0et+b1 Let+b2 L2et+=μ+(b0+b1 L+b2 L2+)et=μ+b( L)et

where b(z)=b0+b1z+b2z2+ is an infinite-order polynomial. The expression Yt=μ+b( L)et is compact way to write the Wold representation.

14.18 Autoregressive Wold Representation

From Theorem 14.16, Yt satisfies a projection onto its infinite past. Theorem 14.18 shows that this projection equals a linear function of the lagged projection errors. An alternative is to write the projection as a linear function of the lagged Yt. It turns out that to obtain a unique and convergent representation we need a strengthening of the conditions.

Theorem 14.19 If Yt is covariance stationary, non-deterministic, with Wold representation Yt=b( L)et, such that |b(z)|δ>0 for all complex |z|1, and for some integer s0 the Wold coefficients satisfy j=0(k=0ksbj+k)2<, then Yt has the representation

Yt=μ+j=1ajYtj+et

for some coefficients μ and aj. The coefficients satisfy k=0ks|ak|< so (14.23) is convergent.

Equation (14.23) is known as an infinite-order autoregressive representation with autoregressive coefficients aj.

A solution to the equation b(z)=0 is a root of the polynomial b(z). The assumption |b(z)|>0 for |z|1 means that the roots of b(z) lie outside the unit circle |z|=1 (the circle in the complex plane with radius one). Theorem 14.19 makes the stronger restriction that |b(z)| is bounded away from 0 for z on or within the unit circle. The need for this strengthening is less intuitive but essentially excludes the possibility of an infinite number of roots outside but arbitrarily close to the unit circle. The summability assumption on the Wold coefficients ensures convergence of the autoregressive coefficients aj. To understand the restriction on the roots of b(z) consider the simple case b(z)=1b1z. (Below we call this a MA(1) model.) The requirement |b(z)|δ for |z|1 means 5|b1|1δ. Thus the assumption in Theorem 14.19 bounds the coefficient strictly below 1 . Now consider an infinite polynomial case b(z)=j=1(1bjz). The assumption in Theorem 14.19 requires supj|bj|<1.

Theorem 14.19 is attributed to Wiener and Masani (1958). For a recent treatment and proof see Corollary 6.1.17 of Politis and McElroy (2020). These authors (as is common in the literature) state their assumptions differently than we do in Theorem 14.19. First, instead of the condition on b(z) they bound from below the spectral density function f(λ) of Yt. We do not define the spectral density in this text so we restate their condition in terms of the linear process polynomial b(z). Second, instead of the condition on the Wold coefficients they require that the autocovariances satisfy k=0ks|γ(k)|<. This is implied by our stated summability condition on the bj (using the expression for γ(k) in Section 14.21 below and simplifying).

14.19 Linear Models

In the previous two sections we showed that any non-deterministic covariance stationary time series has the projection representation

Yt=μ+j=0bjetj

and under a restriction on the projection coefficients satisfies the autoregressive representation

Yt=μ+j=1ajYtj+et.

In both equations the errors et are white noise projection errors. These representations help us understand that linear models can be used as approximations for stationary time series.

For the next several sections we reverse the analysis. We will assume a specific linear model and then study the properties of the resulting time series. In particular we will be seeking conditions under which the stated process is stationary. This helps us understand the properties of linear models. Throughout, we assume that the error et is a strictly stationary and ergodic white noise process. This allows as a special case the stronger assumption that et is i.i.d. but is less restrictive. In particular, it allows for conditional heteroskedasticity.

14.20 Moving Average Processes

The first-order moving average process, denoted MA(1), is

Yt=μ+et+θet1

where et is a strictly stationary and ergodic white noise process with var [et]=σ2. The model is called a “moving average” because Yt is a weighted average of the shocks et and et1.

5 To see this, focus on the case b10. The requirement |1b1z|δ for |z|1 means min|z|1|1b1z|=1b1δ or b11δ. It is straightforward to calculate that a MA(1) has the following moments.

E[Yt]=μvar[Yt]=(1+θ2)σ2γ(1)=θσ2ρ(1)=θ1+θ2γ(k)=ρ(k)=0,k2.

Thus the MA(1) process has a non-zero first autocorrelation with the remainder zero.

A MA(1) process with θ0 is serially correlated with each pair of adjacent observations (Yt1,Yt) correlated. If θ>0 the pair are positively correlated, while if θ<0 they are negatively correlated. The serial correlation is limited in that observations separated by multiple periods are mutually independent.

The qth-order moving average process, denoted MA(q), is

Yt=μ+θ0et+θ1et1+θ2et2++θqetq

where θ0=1. It is straightforward to calculate that a MA(q) has the following moments.

E[Yt]=μvar[Yt]=(j=0qθj2)σ2γ(k)=(j=0qkθj+kθj)σ2,kqρ(k)=j=0qkθj+kθjj=0qθj2γ(k)=ρ(k)=0,k>q.

In particular, a MA(q) has q non-zero autocorrelations with the remainder zero.

A MA(q) process Yt is strictly stationary and ergodic.

A MA(q) process with moderately large q can have considerably more complicated dependence relations than a MA(1) process. One specific pattern which can be induced by a MA process is smoothing. Suppose that the coefficients θj all equal 1. Then Yt is a smoothed version of the shocks et.

To illustrate, Figure 14.3( a) displays a plot of a simulated white noise (i.i.d. N(0,1) ) process with n=120 observations. Figure 14.3(b) displays a plot of a MA(8) process constructed with the same innovations, with θj=1,j=1,,8. You can see that the white noise has no predictable behavior while the MA(8) is smooth.

14.21 Infinite-Order Moving Average Process

An infinite-order moving average process, denoted MA( ), also known as a linear process, is

Yt=μ+j=0θjetj

  1. White Noise

  1. MA(8)

Figure 14.3: White Noise and MA(8)

where et is a strictly stationary and ergodic white noise process, var[et]=σ2, and j=0|θj|<. From Theorem 14.6, Yt is strictly stationary and ergodic. A linear process has the following moments:

E[Yt]=μvar[Yt]=(j=0θj2)σ2γ(k)=(j=0θj+kθj)σ2ρ(k)=j=0θj+kθjj=0θj2.

14.22 First-Order Autoregressive Process

The first-order autoregressive process, denoted AR(1), is

Yt=α0+α1Yt1+et

where et is a strictly stationary and ergodic white noise process with var [et]=σ2. The AR(1) model is probably the single most important model in econometric time series analysis.

As a simple motivating example let Yt be is the employment level (number of jobs) in an economy. Suppose that a fixed fraction 1α1 of employees lose their job and a random number ut of new employees are hired each period. Setting α0=E[ut] and et=utα0, this implies the law of motion (14.25).

To illustrate the behavior of the AR(1) process, Figure 14.4 plots two simulated AR(1) processes. Each is generated using the white noise process et displayed in Figure 14.3(a). The plot in Figure 14.4(a) sets

  1. AR(1) with α1=0.5

  1. AR(1) with α1=0.95

Figure 14.4: AR(1) Processes

α1=0.5 and the plot in Figure 14.4(b) sets α1=0.95. You can see how both are more smooth than the white noise process and that the smoothing increases with α.

Our first goal is to obtain conditions under which (14.25) is stationary. We can do so by showing that Yt can be written as a convergent linear process and then appealing to Theorem 14.5. To find a linear process representation for Yt we can use backward recursion. Notice that Yt in (14.25) depends on its previous value Yt1. If we take (14.25) and lag it one period we find Yt1=α0+α1Yt2+et1. Substituting this into (14.25) we find

Yt=α0+α1(α0+α1Yt2+et1)+et=α0+α1α0+α12Yt2+α1et1+et.

Similarly we can lag (14.31) twice to find Yt2=α0+α1Yt3+et2 and can be used to substitute out Yt2. Continuing recursively t times, we find

Yt=α0(1+α1+α12++α1t1)+α1tY0+α1t1e1+α1t2e2++et=α0j=0t1α1j+α1tY0+j=0t1α1jetj.

Thus Yt equals an intercept plus the scaled initial condition α1tY0 and the moving average j=0t1α1jetj.

Now suppose we continue this recursion into the infinite past. By Theorem 14.3 this converges if j=0|α1|j<. The limit is provided by the following well-known result.

Theorem 14.20k=0βk=11β is absolutely convergent if |β|<1 The series converges by the ratio test (see Theorem A.3 of Probability and Statistics for Economists). To find the limit,

A=k=0βk=1+k=1βk=1+βk=0βk=1+βA.

Solving, we find A=1/(1β).

Thus the intercept in (14.26) converges to α0/(1α1). We deduce the following:

Theorem 14.21 If E|et|< and |α1|<1 then the AR(1) process (14.25) has the convergent representation

Yt=μ+j=0α1jetj

where μ=α0/(1α1). The AR(1) process Yt is strictly stationary and ergodic.

We can compute the moments of Yt from (14.27)

E[Yt]=μ+k=0α1kE[etk]=μvar[Yt]=k=0α12kvar[etk]=σ21α12.

One way to calculate the moments is as follows. Apply expectations to both sides of (14.25)

E[Yt]=α0+α1E[Yt1]+E[et]=α0+α1E[Yt1].

Stationarity implies E[Yt1]=E[Yt]. Solving we find E[Yt]=α0/(1α1). Similarly,

var[Yt]=var[αYt1+et]=α12var[Yt1]+var[et]=α12var[Yt1]+σ2.

Stationarity implies var[Yt1]=var[Yt]. Solving we find var[Yt]=σ2/(1α12). This method is useful for calculation of autocovariances and autocorrelations. For simplicity set α0=0 so that E[Yt]=0 and E[Yt2]=var[Yt]. We find

γ(1)=E[Yt1Yt]=E[Yt1(α1Yt1+et)]=α1var[Yt]

so

ρ(1)=γ(1)/var[Yt]=α1.

Furthermore,

γ(k)=E[YtkYt]=E[Ytk(α1Yt1+et)]=α1γ(k1)

By recursion we obtain

γ(k)=α1kvar[Yt]ρ(k)=α1k.

Thus the AR(1) process with α10 has non-zero autocorrelations of all orders which decay to zero geometrically as k increases. For α1>0 the autocorrelations are all positive. For α1<0 the autocorrelations alternate in sign.

We can also express the AR(1) process using the lag operator notation:

(1α1 L)Yt=α0+et

We can write this as α(L)Yt=α0+et where α(L)=1α1 L. We call α(z)=1α1z the autoregressive polynomial of Yt.

This suggests an alternative way of obtaining the representation (14.27). We can invert the operator (1- α1 L) to write Yt as a function of lagged et. That is, suppose that the inverse operator (1α1 L)1 exists. Then we can use this operator on (14.28) to find

Yt=(1α1 L)1(1α1 L)Yt=(1α1 L)1(α0+et).

What is the operator (1α1 L)1 ? Recall from Theorem 14.20 that for |x|<1,

j=0xj=11x=(1x)1.

Evaluate this expression at x=α1z. We find

(1α1z)1=j=0α1jzj.

Setting z=L this is

(1α1 L)1=j=0α1j Lj.

Substituted into (14.29) we obtain

Yt=(1α1 L)1(α0+et)=(j=0αj Lj)(α0+et)=j=0α1j Lj(α0+et)=j=0α1j(α0+etj)=α01α1+j=0α1jetj

which is (14.27). This is valid for |α1|<1.

This illustrates another important concept. We say that a polynomial α(z) is invertible if

α(z)1=j=0ajzj

is absolutely convergent. In particular, the AR(1) autoregressive polynomial α(z)=1α1z is invertible if |α1|<1. This is the same condition as for stationarity of the AR(1) process. Invertibility turns out to be a useful property.

14.23 Unit Root and Explosive AR(1) Processes

The AR(1) process (14.25) is stationary if |α1|<1. What happens otherwise?

If α0=0 and α1=1 the model is known as a random walk.

Yt=Yt1+et.

This is also called a unit root process, a martingale, or an integrated process. By back-substitution

Yt=Y0+j=1tej.

Thus the initial condition does not disappear for large t. Consequently the series is non-stationary. The autoregressive polynomial α(z)=1z is not invertible, meaning that Yt cannot be written as a convergent function of the infinite past history of et.

The stochastic behavior of a random walk is noticably different from a stationary AR(1) process. It wanders up and down with equal likelihood and is not mean-reverting. While it has no tendency to return to its previous values the wandering nature of a random walk can give the illusion of mean reversion. The difference is that a random walk will take a very large number of time periods to “revert”.

  1. Example 1

  1. Example 2

Figure 14.5: Random Walk Processes

To illustrate, Figure 14.5 plots two independent random walk processes. The plot in panel (a) uses the innovations from Figure 14.3(a). The plot in panel (b) uses an independent set of i.i.d. N(0,1) errors. You can see that the plot in panel (a) appears similar to the MA(8) and AR(1) plots in the sense that the series is smooth with long swings, but the difference is that the series does not return to a longterm mean. It appears to have drifted down over time. The plot in panel (b) appears to have quite different behavior, falling dramatically over a 5-year period, and then appearing to stabilize. These are both common behaviors of random walk processes. If α1>1 the process is explosive. The model (14.25) with α1>1 exhibits exponential growth and high sensitivity to initial conditions. Explosive autoregressive processes do not seem to be good descriptions for most economic time series. While aggregate time series such as the GDP process displayed in Figure 14.1 (a) exhibit a similar exponential growth pattern, the exponential growth can typically be removed by taking logarithms.

The case α1<1 induces explosive oscillating growth and does not appear to be empirically relevant for economic applications.

14.24 Second-Order Autoregressive Process

The second-order autoregressive process, denoted AR(2), is

Yt=α0+α1Yt1+α2Yt2+et

where et is a strictly stationary and ergodic white noise process. The dynamic patterns of an AR(2) process are more complicated than an AR(1) process.

As a motivating example consider the multiplier-accelerator model of Samuelson (1939). It might be a bit dated as a model but it is simple so hopefully makes the point. Aggregate output (in an economy with no trade) is defined as Yt= Consumption t+ Investment t+ Gov t. Suppose that individuals make their consumption decisions on the previous period’s income Consumption t=bYt1, firms make their investment decisions on the change in consumption Investment tt=dΔCt, and government spending is random Govt=a+et. Then aggregate output follows

Yt=a+b(1+d)Yt1bdYt2+et

which is an AR(2) process.

Using the lag operator we can write (14.31) as

Ytα1 LYtα2 L2Yt=α0+et

or α(L)Yt=α0+et where α(L)=1α1 Lα2 L2. We call α(z) the autoregressive polynomial of Yt.

We would like to find the conditions for the stationarity of Yt. It turns out that it is convenient to transform the process (14.31) into a VAR(1) process (to be studied in the next chapter). Set Y~t=(Yt,Yt1), which is stationary if and only if Yt is stationary. Equation (14.31) implies that Y~t satisfies

(YtYt1)=(α1α210)(Yt1Yt2)+(a0+et0)

or

Y~t=AY~t1+e~t

where A=(α1α210) and e~t=(a0+et,0). Equation (14.33) falls in the class of VAR(1) models studied in Section 15.6. Theorem 15.6 shows that the VAR(1) process is strictly stationary and ergodic if the innovations satisfy Ee~t< and all eigenvalues λ of A are less than one in absolute value. The eigenvalues satisfy det(AI2λ)=0, where

det(AI2λ)=det(α1λα21λ)=λ2λα1α2=λ2α(1/λ)

and α(z)=1α1zα2z2 is the autoregressive polynomial. Thus the eigenvalues satisfy α(1/λ)=0. Factoring the autoregressive polynomial as α(z)=(1λ1z)(1λ2z) the solutions α(1/λ)=0 must equal λ1 and λ2. The quadratic formula shows that these equal

λj=α1±α12+4α22.

These eigenvalues are real if α12+4α20 and are complex conjugates otherwise. The AR(2) process is stationary if the solutions (14.34) satisfy |λj|<1.

Figure 14.6: Stationarity Region for AR(2)

Using (14.34) to solve for the AR coefficients in terms of the eigenvalues we find α1=λ1+λ2 and α2=λ1λ2. With some algebra (the details are deferred to Section 14.47) we can show that |λ1|<1 and |λ2|<1 iff the following restrictions hold on the autoregressive coefficients:

α1+α2<1α2α1<1α2>1.

These restrictions describe a triangle in (α1,α2) space which is shown in Figure 14.6. Coefficients within this triangle correspond to a stationary AR(2) process.

Take the Samuelson multiplier-accelerator model (14.32). You can calculate that (14.35)-(14.37) are satisfied (and thus the process is strictly stationary) if 0b<1 and 0d1, which are reasonable restrictions on the model parameters. The most important restriction is b<1, which in the language of old-school macroeconomics is that the marginal propensity to consume out of income is less than one.

Furthermore, the triangle is divided into two regions as marked in Figure 14.6: the region above the parabola α12+4α2=0 producing real eigenvalues λj, and the region below the parabola producing complex eigenvalues λj. This is interesting because when the eigenvalues are complex the autocorrelations of Yt display damped oscillations. For this reason the dynamic patterns of an AR(2) can be much more complicated than those of an AR(1).

Again, take the Samuelson multiplier-accelerator model (14.32). You can calculate that if b0, the model has real eigenvalues iff b4d/(1+d)2, which holds for b large and d small, which are “stable” parameterizations. On the other hand, the model has complex eigenvalues (and thus oscillations) for sufficiently small b and large d.

Theorem 14.22 If E|et|< and |λj|<1 for λj defined in (14.34), or equivalently if the inequalities (14.35)-(14.37) hold, then the AR(2) process (14.31) is absolutely convergent, strictly stationary, and ergodic.

The proof is presented in Section 14.47.

  1. AR(2)

  1. AR(2) with Complex Roots

Figure 14.7: AR(2) Processes

To illustrate, Figure 14.7 displays two simulated AR(2) processes. The plot in panel (a) sets α1=α2= 0.4. These coefficients produce real factors so the process displays behavior similar to that of the AR(1) processes. The plot in panel (b) sets α1=1.3 and α2=0.8. These coefficients produce complex factors so the process displays oscillations.

14.25 AR(p) Processes

The pth -order autoregressive process, denoted AR(p), is

Yt=α0+α1Yt1+α2Yt2++αpYtp+et

where et is a strictly stationary and ergodic white noise process.

Using the lag operator,

Ytα1 LYtα2 L2Ytαp LpYt=α0+et

or α(L)Yt=α0+et where

α(L)=1α1 Lα2 L2αp Lp.

We call α(z) the autoregressive polynomial of Yt.

We find conditions for the stationarity of Yt by a technique similar to that used for the AR(2) process. Set Y~t=(Yt,Yt1,,Ytp+1) and e~t=(a0+et,0,,0). Equation (14.38) implies that Y~t satisfies the VAR(1) equation (14.33) with

A=(α1α2αp1αp100001000010)

As shown in the proof of Theorem 14.23 below, the eigenvalues λj of A are the reciprocals of the roots rj of the autoregressive polynomial (14.39). The roots rj are the solutions to α(rj)=0. Theorem 15.6 shows that stationarity of Y~t holds if the eigenvalues λj are less than one in absolute value, or equivalently when the roots rj are greater than one in absolute value. For complex numbers the equation |z|=1 defines the unit circle (the circle with radius of unity). We therefore say that ” z lies outside the unit circle” if |z|>1.

Theorem 14.23 If E|et|< and all roots of α(z) lie outside the unit circle then the AR(p) process (14.38) is absolutely convergent, strictly stationary, and ergodic.

When the roots of α(z) lie outside the unit circle then the polynomial α(z) is invertible. Inverting the autoregressive representation α(L)Yt=α0+et we obtain an infinite-order moving average representation

Yt=μ+b( L)et

where

b(z)=α(z)1=j=0bjzj

and μ=α(1)1a0.

We have the following characterization of the moving average coefficients. Theorem 14.24 If all roots rj of the autoregressive polynomial α(z) satisfy |rj|>1 then (14.41) holds with |bj|(j+1)pλj and j=0|bj|< where λ=max1jp|rj1|<1

The proof is presented in Section 14.47.

14.26 Impulse Response Function

The coefficients of the moving average representation

Yt=b( L)et=j=0bjetj=b0et+b1et1+b2et2+

are known among economists as the impulse response function (IRF). Often the IRF is scaled by the standard deviation of et. We discuss this scaling at the end of the section. In linear models the impulse response function is defined as the change in Yt+j due to a shock at time t. This is

etYt+j=bj.

This means that the coefficient bj can be interpreted as the magnitude of the impact of a time t shock on the time t+j variable. Plots of bj can be used to assess the time-propagation of shocks.

It is desirable to have a convenient method to calculate the impulse responses bj from the coefficients of an autoregressive model (14.38). There are two methods which we now describe.

The first uses a simple recursion. In the linear AR(p) model, we can see that the coefficient bj is the simple derivative

bj=etYt+j=e0Yj

We can calculate bj by generating a history and perturbing the shock e0. Since this calculation is unaffected by all other shocks we can simply set et=0 for t0 and set e0=1. This implies the recursion

b0=1b1=α1b0b2=α1b1+α2b0bj=α1bj1+α2bj2++αpbjp.

This recursion is conveniently calculated by the following simulation. Set Yt=0 for t0. Set e0=1 and et=0 for t1. Generate Yt for t0 by Yt=α1Yt1+α2Yt2++αpYtp+et. Then Yj=bj.

A second method uses the vector representation (14.33) of the AR(p) model with coefficient matrix (14.40). By recursion

Y~t=j=0Aje~tj

Here, Aj=AA means the jth matrix product of A with itself. Setting S=(1,0,0) we find

Yt=j=0SAjSetj.

By linearity

bj=etYt+j=SAjS.

Thus the coefficient bj can be calculated by forming the matrix A, its j-fold product Aj, and then taking the upper-left element.

As mentioned at the beginning of the section it is often desirable to scale the IRF so that it is the response to a one-deviation shock. Let σ2=var[et] and define εt=et/σ which has unit variance. Then the IRF at lag j is

IRFj=εtYt+j=σbj.

14.27 ARMA and ARIMA Processes

The autoregressive-moving-average process, denoted ARMA(p,q), is

Yt=α0+α1Yt1+α2Yt2++αpYtp+θ0et+θ1et1+θ2et2++θqetq

where et is a strictly stationary and erogodic white noise process. It can be written using lag operator notation as α(L)Yt=α0+θ(L)et.

Theorem 14.25 The ARMA(p,q) process (14.43) is strictly stationary and ergodic if all roots of α(z) lie outside the unit circle. In this case we can write

Yt=μ+b( L)et

where bj=O(jpβj) and j=0|bj|<.

The process Yt follows an autoregressive-integrated moving-average process, denoted ARIMA(p,d,q), if ΔdYt is ARMA(p,q). It can be written using lag operator notation as α(L)(1L)dYt=α0+θ(L)et.

14.28 Mixing Properties of Linear Processes

There is a considerable probability literature investigating the mixing properties of time series processes. One challenge is that as autoregressive processes depend on the infinite past sequence of innovations et it is not immediately obvious if they satisfy the mixing conditions.

In fact, a simple AR(1) is not necessarily mixing. A counter-example was developed by Andrews (1984). He showed that if the error et has a two-point discrete distribution then an AR(1) is not strong mixing. The reason is that a discrete innovation combined with the autoregressive structure means that by observing Yt you can deduce with near certainty the past history of the shocks et. The example seems rather special but shows the need to be careful with the theory. The intuition stemming from Andrews’ example is that for an autoregressive process to be mixing it is necessary for the errors et to be continuous.

A useful characterization was provided by Pham and Tran (1985).

Theorem 14.26 Suppose that Yt=μ+j=0θjetj satisfies the following conditions:

  1. et is i.i.d. with E|et|r< for some r>0 and density f(x) which satisfies

|f(xu)f(x)|dxC|u|

for some C<.

1. All roots of θ(z)=0 lie outside the unit circle and j=0|θj|<.

  1. k=1(j=k|θj|)r/(1+r)<.

Then for some B<

α()4β()Bk=(j=k|θj|)r/(1+r)

and Yt is absolutely regular and strong mixing.

The condition (14.44) is rather unusual, but specifies that et has a smooth density. This rules out Andrews’ counter-example.

The summability condition on the coefficients in part 3 involves a trade-off with the number of moments r. If et has all moments finite (e.g. normal errors) then we can set r= and this condition simplifies to k=1k|θk|<. For any finite r the summability condition holds if θj has geometric decay.

It is instructive to deduce how the decay in the coefficients θj affects the rate for the mixing coefficients α(). If |θj|O(jη) then j=k|θj|O(k(η1)) so the rate is α()4β()O(s) for s=(η1)r/(1+r)1. Mixing requires s>0, which holds for sufficiently large η. For example, if r=4 it holds for η>9/4.

The primary message from this section is that linear processes, including autoregressive and ARMA processes, are mixing if the innovations satisfy suitable conditions. The mixing coefficients decay at rates related to the decay rates of the moving average coefficients.

14.29 Identification

The parameters of a model are identified if the parameters are uniquely determined by the probability distribution of the observations. In the case of linear time series analysis we typically focus on the first two moments of the observations (means, variances, covariances). We therefore say that the coefficients of a stationary MA, AR, or ARMA model are identified if they are uniquely determined by the autocorrelation function. That is, given the autocorrelation function ρ(k), are the coefficients unique? It turns out that the answer is that MA and ARMA models are generally not identified. Identification is achieved by restricting the class of polynomial operators. In contrast, AR models are generally identified.

Let us start with the MA(1) model

Yt=et+θet1.

It has first-order autocorrelation

ρ(1)=θ1+θ2.

Set ω=1/θ. Then

ω1+ω2=1/ω1+(1/ω)2=θ1+θ2=ρ(1).

Thus the MA(1) model with coefficient ω=1/θ produces the same autocorrelations as the MA(1) model with coefficient θ. For example, θ=1/2 and ω=2 each yield ρ(1)=2/5. There is no empirical way to distinguish between the models Yt=et+θet1 and Yt=et+ωet1. Thus the coefficient θ is not identified.

The standard solution is to select the parameter which produces an invertible moving average polynomial. Since there is only one such choice this yields a unique solution. This may be sensible when there is reason to believe that shocks have their primary impact in the contemporaneous period and secondary (lesser) impact in the second period.

Now consider the MA(2) model

Yt=et+θ1et1+θ2et2.

The moving average polynomial can be factored as

θ(z)=(1β1z)(1β2z)

so that β1β2=θ2 and β1+β2=θ1. The process has first- and second-order autocorrelations

ρ(1)=θ1+θ1θ21+θ12+θ22=β1β2β12β2β1β221+β12+β22+2β1β2+β12β22ρ(2)=θ21+θ12+θ22=β1β21+β12+β22+2β1β2+β12β22.

If we replace β1 with ω1=1/β1 we obtain

ρ(1)=1/β1β2β2/β12β22/β11+1/β12+β22+2β2/β1+β22/β12=β1β2β12β2β22β1β12+1+β22β12+2β2β1+β22ρ(2)=β2/β11+1/β12+β22+2β2/β1+β22/β12=β1β2β12+1+β12β22+2β1β2+β22

which is unchanged. Similarly if we replace β2 with ω2=1/β2 we obtain unchanged first- and secondorder autocorrelations. It follows that in the MA(2) model the factors β1 and β2 nor the coefficients θ1 and θ2 are identified. Consequently there are four distinct MA(2) models which are identifiably indistinguishable.

This analysis extends to the MA(q) model. The factors of the MA polynomial can be replaced by their inverses and consequently the coefficients are not identified.

The standard solution is to confine attention to MA(q) models with invertible roots. This technically solves the identification dilemma. This solution corresponds to the Wold decomposition, as it is defined in terms of the projection errors which correspond to the invertible representation.

A deeper identification failure occurs in ARMA models. Consider an ARMA(1,1) model

Yt=αYt1+et+θet1.

Written in lag operator notation

(1αL)Yt=(1+θL)et.

The identification failure is that when α=θ then the model simplifies to Yt=et. This means that the continuum of models with α=θ are all identical and the coefficients are not identified.

This extends to higher order ARMA models. Take the ARMA (2,2) model written in factored lag operator notation

(1α1 L)(1α2 L)Yt=(1+θ1 L)(1+θ2 L)et.

The models with α1=θ1,α1=θ2,α2=θ1, or α2=θ2 all simplify to an ARMA(1,1). Thus all these models are identical and hence the coefficients are not identified.

The problem is called “cancelling roots” due to the fact that it arises when there are two identical lag polynomial factors in the AR and MA polynomials.

The standard solution in the ARMA literature is to assume that there are no cancelling roots. The trouble with this solution is that this is an assumption about the true process which is unknown. Thus it is not really a solution to the identification problem. One recommendation is to be careful when using ARMA models and be aware that highly parameterized models may not have unique coefficients.

Now consider the AR(p) model (14.38). It can be written as

Yt=Xtα+et

where α=(α0,α1,αp) and Xt=(1,Yt1,,Ytp). The MDS assumption implies that E[et]=0 and E[Xtet]=0. This means that the coefficient α satisfies

α=(E[XtXt])1(E[XtYt]).

This equation is unique if Q=E[XtXt] is positive definite. It turns out that this is generically true so α is unique and identified.

Theorem 14.27 In the AR(p) model (14.38), if 0<σ2< then Q>0 and α is unique and identified.

The assumption σ2>0 means that Yt is not purely deterministic.

We can extend this result to approximating AR(p) models. That is, consider the equation (14.45) without the assumption that Yt is necessarily a true AR(p) with a MDS error. Instead, suppose that Yt is a non-deterministic stationary process. (Recall, non-deterministic means that σ2>0 where σ2 is the projection error variance (14.19).) We then define the coefficient α as the best linear predictor, which is (14.46). The error et is defined by the equation (14.45). This is a linear projection model.

As in the case of any linear projection, the error et satisfies E[Xtet]=0. This means that E[et]=0 and E[Ytjet]=0 for j=1,,p. However, the error et is not necessarily a MDS nor white noise.

The coefficient α is identified if Q>0. The proof of Theorem 14.27 (presented in Section 14.47) does not make use of the assumption that Yt is an AR(p) with a MDS error. Rather, it only uses the assumption that σ2>0. This holds in the approximate AR(p) model as well under the assumption that Yt is nondeterministic. We conclude that any approximating AR(p) is identified.

Theorem 14.28 If Yt is strictly stationary, not purely deterministic, and E[Yt2]<, then for any p,Q=E[XtXt]>0 and thus the coefficient vector (14.46) is identified.

14.30 Estimation of Autoregressive Models

We consider estimation of an AR(p) model for stationary, ergodic, and non-deterministic Yt. The model is (14.45) where Xt=(1,Yt1,,Ytp). The coefficient α is defined by projection in (14.46). The error is defined by (14.45) and has variance σ2=E[et2]. This allows Yt to follow a true AR(p) process but it is not necessary.

The least squares estimator is

α^=(t=1nXtXt)1(t=1nXtYt).

This notation presumes that there are n+p total observations on Yt from which the first p are used as initial conditions so that X1=(1,Y0,Y1,,Yp+1) is defined. Effectively, this redefines the sample period. (An alternative notational choice is to define the periods so the sums range from observations p+1 to n.)

The least squares residuals are e^t=YtXtα^. The error variance can be estimated by σ^2=n1t=1ne^t2 or s2=(np1)1t=1ne^t2.

If Yt is strictly stationary and ergodic then so are XtXt and XtYt. They have finite means if E[Yt2]<. Under these assumptions the Ergodic Theorem implies that

1nt=1nXtYtpE[XtYt]

and

1nt=1nXtXtpE[XtXt]=Q.

Theorem 14.28 shows that Q>0. Combined with the continuous mapping theorem we see that

α^=(1nt=1nXtXt)1(1nt=1nXtYt)p(E[XtXt])1E[XtYt]=α.

It is straightforward to show that σ^2 is consistent as well.

Theorem 14.29 If Yt is strictly stationary, ergodic, not purely deterministic, and E[Yt2]<, then for any p,α^pα and σ^2pσ2 as n.

This shows that under very mild conditions the coefficients of an AR(p) model can be consistently estimated by least squares. Once again, this does not require that the series Yt is actually an AR(p) process. It holds for any stationary process with the coefficient defined by projection.

14.31 Asymptotic Distribution of Least Squares Estimator

The asymptotic distribution of the least squares estimator α^ depends on the stochastic assumptions. In this section we derive the asymptotic distribution under the assumption of correct specification.

Specifically, we assume that the error et is a MDS. An important implication of the MDS assumption is that since Xt=(1,Yt1,,Ytp) is part of the information set Ft1, by the conditioning theorem,

E[XtetFt1]=XtE[etFt1]=0.

Thus Xtet is a MDS. It has a finite variance if et has a finite fourth moment. To see this, by Theorem 14.24,Yt=μ+j=0bjetj with j=0|bj|<. Using Minkowski’s Inequality,

(E|Yt|4)1/4j=0|bj|(E|etj|4)1/4<.

Thus E[Yt4]<. The Cauchy-Schwarz inequality then shows that EXtet2<. We can then apply the martingale difference CLT (Theorem 14.11) to see that

1nt=1nXtetdN(0,Σ)

where Σ=E[XtXtet2]

Theorem 14.30 If Yt follows the AR(p) model (14.38), all roots of a(z) lie outside the unit circle, E[etFt1]=0,E[et4]<, and E[et2]>0, then as n, n(α^α)dN(0,V) where V=Q1ΣQ1.

This is identical in form to the asymptotic distribution of least squares in cross-section regression. The implication is that asymptotic inference is the same. In particular, the asymptotic covariance matrix is estimated just as in the cross-section case.

14.32 Distribution Under Homoskedasticity

In cross-section regression we found that the covariance matrix simplifies under the assumption of conditional homoskedasticity. The same occurs in the time series context. Assume that the error is a homoskedastic MDS:

E[etFt1]=0E[et2Ft1]=σ2.

In this case

Σ=E[XtXtE[et2Ft1]]=Qσ2

and the asymptotic distribution simplifies.

Theorem 14.31 Under the assumptions of Theorem 14.30, if in addition E[et2Ft1]=σ2, then as n,n(α^α)dN(0,V0) where V0=σ2Q1.

These results show that under correct specification (a MDS error) the format of the asymptotic distribution of the least squares estimator exactly parallels the cross-section case. In general the covariance matrix takes a sandwich form with components exactly equal to the cross-section case. Under conditional homoskedasticity the covariance matrix simplies exactly as in the cross-section case. A particularly useful insight which can be derived from Theorem 14.31 is to focus on the simple AR(1) with no intercept. In this case Q=E[Yt2]=σ2/(1α12) so the asymptotic distribution simplifies to

n(α^1α1)dN(0,1α12).

Thus the asymptotic variance depends only on α1 and is decreasing with α12. An intuition is that larger α12 means greater signal and hence greater estimation precision. This result also shows that the asymptotic distribution is non-similar: the variance is a function of the parameter of interest. This means that we can expect (from advanced statistical theory) asymptotic inference to be less accurate than indicated by nominal levels.

In the context of cross-section data we argued that the homoskedasticity assumption was dubious except for occassional theoretical insight. For practical applications it is recommended to use heteroskedasticityrobust theory and methods when possible. The same argument applies to the time series case. While the distribution theory simplifies under conditional homoskedasticity there is no reason to expect homoskedasticity to hold in practice. Therefore in applications it is better to use the heteroskedasticityrobust distributional theory when possible.

Unfortunately, many existing time series textbooks report the distribution theory from (14.31). This has influenced computer software packages many of which also by default (or exclusively) use the homoskedastic distribution theory. This is unfortunate.

14.33 Asymptotic Distribution Under General Dependence

If the AR(p) model (14.38) holds with white noise errors or if the AR(p) is an approximation with α defined as the best linear predictor then the MDS central limit theory does not apply. Instead, if Yt is strong mixing we can use the central limit theory for mixing processes (Theorem 14.15).

Theorem 14.32 Assume that Yt is strictly stationary, ergodic, and for some r> 4,E|Yt|r< and the mixing coefficients satisfy =1α()14/r<. Let α be defined as the best linear projection coefficients (14.46) from an AR(p) model with projection errors et. Let α^ be the least squares estimator of α. Then

Ω==E[XtXtetet]

is convergent and n(α^α)dN(0,V) as n, where V=Q1ΩQ1.

This result is substantially different from the cross-section case. It shows that model misspecification (including misspecifying the order of the autoregression) renders invalid the conventional “heteroskedasticityrobust” covariance matrix formula. Misspecified models do not have unforecastable (martingale difference) errors so the regression scores Xtet are potentially serially correlated. The asymptotic variance takes a sandwich form with the central component Ω the long-run variance (recall Section 14.13) of the regression scores Xtet.

14.34 Covariance Matrix Estimation

Under the assumption of correct specification covariance matrix estimation is identical to the crosssection case. The asymptotic covariance matrix estimator under homoskedasticity is

V^0=σ^2Q^1Q^=1nt=1nXtXt

The estimator s2 may be used instead of σ^2.

The heteroskedasticity-robust asymptotic covariance matrix estimator is

V^=Q^1Σ^Q^1

where

Σ^=1nt=1nXtXte^t2.

Degree-of-freedom adjustments may be made as in the cross-section case though a theoretical justification has not been developed.

Standard errors s(α^j) for individual coefficient estimates can be formed by taking the scaled diagonal elements of V^

Theorem 14.33 Under the assumptions of Theorem 14.32, as n,V^pV and (α^jαj)/s(α^j)dN(0,1)

Theorem 14.33 shows that standard covariance matrix estimation is consistent and the resulting tratios are asymptotically normal. This means that for stationary autoregressions, inference can proceed using conventional regression methods.

14.35 Covariance Matrix Estimation Under General Dependence

Under the assumptions of Theorem 14.32 the conventional covariance matrix estimators are inconsistent as they do not capture the serial dependence in the regression scores Xtet. To consistently estimate the covariance matrix we need an estimator of the long-run variance Ω. The appropriate class of estimators are called Heteroskedasticity and Autocorrelation Consistent (HAC) or Heteroskedasticity and Autocorrelation Robust (HAR) covariance matrix estimators.

To understand the methods it is helpful to define the vector series ut=Xtet and autocovariance matrices Γ()=E[utut] so that

Ω==Γ().

Since this sum is convergent the autocovariance matrices converge to zero as . Therefore Ω can be approximated by taking a finite sum of autocovariances such as

ΩM==MMΓ().

The number M is sometimes called the lag truncation number. Other authors call it the bandwidth. An estimator of Γ() is

Γ^()=1n1tnu^tu^t

where u^t=Xte^t. By the ergodic theorem we can show that for any ,Γ^()pΓ(). Thus for any fixed M, the estimator

Ω^M==MMΓ^()

is consistent for ΩM.

If the serial correlation in Xtet is known to be zero after M lags, then ΩM=Ω and the estimator (14.49) is consistent for Ω. This estimator was proposed by L. Hansen and Hodrick (1980) in the context of multiperiod forecasts and by L. Hansen (1982) for the generalized method of moments.

In the general case we can select M to increase with sample size n. If the rate at which M increases is sufficiently slow then Ω^M will be consistent for Ω, as first shown by White and Domowitz (1984).

Once we view the lag truncation number M as a choice the estimator (14.49) has two potential deficiencies. One is that Ω^M can change non-smoothly with M which makes estimation results sensitive to the choice of M. The other is that Ω^M may not be positive semi-definite and is therefore not a valid covariance matrix estimator. We can see this in the simple case of scalar ut and M=1. In this case Ω^1=γ^(0)(1+2ρ^(1)) which is negative when ρ^(1)<1/2. Thus if the data are strongly negatively autocorrelated the variance estimator can be negative. A negative variance estimator means that standard errors are ill-defined (a naïve computation will produce a complex standard error which makes no sense 6 ).

These two deficiencies can be resolved if we amend (14.49) by a weighted sum of autocovariances. Newey and West (1987b) proposed

Ω^nW==MM(1||M+1)Γ^()

This is a weighted sum of the autocovariances. Other weight functions can be used; the one in (14.50) is known as the Bartlett kernel 7. Newey and West (1987b) showed that this estimator has the algebraic property that Ω^nw0 (it is positive semi-definite), solving the negative variance problem, and it is also a smooth function of M. Thus this estimator solves the two problems described above.

For Ω^nw to be consistent for Ω the lag trunction number M must increase to infinity with n. Sufficient conditions were established by B. E. Hansen (1992).

Theorem 14.34 Under the assumptions of Theorem 14.32 plus =1α()1/24/r<, if M yet M3/n=O(1), then as n,Ω^nwpΩ

The assumption M3/n=O(1) technically means that M grows no faster than n1/3 but this does not have a practical counterpart other than the implication that ” M should be much smaller than n “. The assumption on the mixing coefficients is slightly stronger than in Theorem 14.32, due to the technical nature of the derivation.

6 A common computational mishap is a complex standard error. This occurs when a covariance matrix estimator has negative elements on the diagonal.

7 See Andrews (1991b) for a description of popular options. In practice, the choice of weight function is much less important than the choice of lag truncation number M. A important practical issue is how to select M. One way to think about it is that M impacts the precision of the estimator Ω^nw through its bias and variance. Since Γ^() is a sample average its variance is O(1/n) so we expect the variance of Ω^M to be of order O(M/n). The bias of Ω^nw for Ω is harder to calculate but depends on the rate at which the covariances Γ() decay to zero. Andrews (1991b) found that the M which minimizes the mean squared error of Ω^nw satisfies the rate M=Cn1/3 where the constant C depends on the autocovariances. Practical rules to estimate and implement this optimal lag truncation parameter have been proposed by Andrews (1991b) and Newey and West (1994). Andrews’ rule for the Newey-West estimator (14.50) can be written as

M=(6ρ2(1ρ2)2)1/3n1/3

where ρ is a serial correlation parameter. When ut is scalar, ρ is the first autocorrelation of ut. Andrews suggested using an estimator of ρ to plug into this formula to find M. An alternative is to use a default value of ρ. For example, if we set ρ=0.5 then the Andrews rule is M=1.4n1/3, which is a useful benchmark.

14.36 Testing the Hypothesis of No Serial Correlation

In some cases it may be of interest to test the hypothesis that the series Yt is serially uncorrelated against the alternative that it is serially correlated. There have been many proposed tests of this hypothesis. The most appropriate is based on the least squares regression of an AR(p) model. Take the model

Yt=α0+α1Yt1+α2Yt2++αpYtp+et

with et a MDS. In this model the series Yt is serially uncorrelated if the slope coefficients are all zero. Thus the hypothesis of interest is

H0:α1==αp=0H1:αj0 for some j1.

The test can be implemented by a Wald or F test. Estimate the AR(p) model by least squares. Form the Wald or F statistic using the variance estimator (14.48). (The Newey-West estimator should not be used as there is no serial correlation under the null hypothesis.) Accept the hypothesis if the test statistic is smaller than a conventional critical value (or if the p-value exceeds the significance level) and reject the hypothesis otherwise.

Implementation of this test requires a choice of autoregressive order p. This choice affects the power of the test. A sufficient number of lags should be included so to pick up potential serial correlation patterns but not so many that the power of the test is diluted. A reasonable choice in many applications is to set p to equals s, the seasonal periodicity. Thus include four lags for quarterly data or twelve lags for monthly data.

14.37 Testing for Omitted Serial Correlation

When using an AR(p) model it may be of interest to know if there is any remaining serial correlation. This can be expressed as a test for serial correlation in the error or equivalently as a test for a higher-order autogressive model. Take the AR(p) model

Yt=α0+α1Yt1+α2Yt2++αpYtp+ut.

The null hypothesis is that ut is serially uncorrelated and the alternative hypothesis is that it is serially correlated. We can model the latter as a mean-zero autoregressive process

ut=θ1ut1++θqutq+et.

The hypothesis is

H0:θ1==θq=0H1:θj0 for some j1.

A seemingly natural test for H0 uses a two-step method. First estimate (14.52) by least squares and obtain the residuals u^t. Second, estimate (14.53) by least squares by regressing u^t on its lagged values and obtain the Wald or F test for M0. This seems like a natural approach but it is muddled by the fact that the distribution of the Wald statistic is distorted by the two-step procedure. The Wald statistic is not asymptotically chi-square so it is inappropriate to make a decision based on the conventional critical values. One approach to obtain the correct asymptotic distribution is to use the generalized method of moments, treating (14.52)-(14.53) as a two-equation just-identified system.

An easier solution is to re-write (14.52)-(14.53) as a higher-order autoregression so that we can use a standard test statistic. To illustrate how this works take the case q=1. Take (14.52) and lag the equation once:

Yt1=α0+α1Yt2+α2Yt3++αpYtp1+ut1.

Multiply this by θ1 and subtract from (14.52) to find

Ytθ1Yt1=α0+α1Yt1+α2Yt2++αpYtp+utθ1α0θ1α1Yt2θ1α2Yt3θ1αpYtp1θ1ut1

or

Yt=α0(1θ1)+(α1+θ1)Yt1+(α2θ1α1)Yt2+θ1αpYtp1+et.

This is an AR(p+1). It simplifies to an AR(p) when θ1=0. Thus H0 is equivalent to the restriction that the coefficient on Ytp1 is zero.

Thus testing the null hypothesis of an AR(p) (14.52) against the alternative that the error is an AR(1) is equivalent to testing an AR(p) against an AR(p+1). The latter test is implemented as a t test on the coefficient on Ytp1.

More generally, testing the null hypothesis of an AR(p) (14.52) against the alternative that the error is an AR(q) is equivalent to testing that Yt is an AR(p) against the alternative that Yt is an AR(p+q). The latter test is implemented as a Wald (or F) test on the coefficients on Ytp1,,Ytpq. If the statistic is smaller than the critical values (or the p-value is larger than the significance level) then we reject the hypothesis that the AR(p) is correctly specified in favor of the alternative that there is omitted serial correlation. Otherwise we accept the hypothesis that the AR(p) model is correctly specified.

Another way of deriving the test is as follows. Write (14.52) and (14.53) using lag operator notation α(L)Yt=α0+ut with θ(L)ut=et. Applying the operator θ(L) to the first equation we obtain θ(L)α(L)Yt= α0+et where α0=θ(1)α0. The product θ(L)α(L) is a polynomial of order p+q so Yt is an AR(p+q).

While this discussion is all good fun, it is unclear if there is good reason to use the test described in this section. Economic theory does not typically produce hypotheses concerning the autoregressive order. Consequently there is rarely a case where there is scientific interest in testing, say, the hypothesis that a series is an AR(4) or any other specific autoregressive order. Instead, practitioners tend to use hypothesis tests for another purpose - model selection. That is, in practice users want to know “What autoregressive model should be used” in a specific application and resort to hypothesis tests to aid in this decision. This is an inappropriate use of hypothesis tests because tests are designed to provide answers to scientific questions rather than being designed to select models with good approximation properties. Instead, model selection should be based on model selection tools. One is described in the following section.

14.38 Model Selection

What is an appropriate choice of autoregressive order p ? This is the problem of model selection. A good choice is to minimize the Akaike information criterion (AIC)

AIC(p)=nlogσ^2(p)+2p

where σ^2(p) is the estimated residual variance from an AR(p). The AIC is a penalized version of the Gaussian log-likelihood function for the estimated regression model. It is an estimator of the divergence between the fitted model and the true conditional density (see Section 28.4). By selecting the model with the smallest value of the AIC you select the model with the smallest estimated divergence - the highest estimated fit between the estimated and true densities.

The AIC is also a monotonic transformation of an estimator of the one-step-ahead forecast mean squared error. Thus selecting the model with the smallest value of the AIC you are selecting the model with the smallest estimated forecast error.

One possible hiccup in computing the AIC criterion for multiple models is that the sample size available for estimation changes as p changes. (If you increase p, you need more initial conditions.) This renders AIC comparisons inappropriate. The same sample - the same number of observations - should be used for estimation of all models. This is because AIC is a penalized likelihood, and if the samples are different then the likelihoods are not the same. The appropriate remedy is to fix a upper value p¯, and then reserve the first p¯ as initial conditions. Then estimate the models AR(1),AR(2),,AR(p¯) on this (unified) sample.

The AIC of an estimated regression model can be displayed in Stata by using the estimates stats command.

14.39 Illustrations

We illustrate autoregressive estimation with three empirical examples using U.S. quarterly time series from the FRED-QD data file.

The first example is real GDP growth rates (growth rate of gdpc1 ). We estimate autoregressive models of order 0 through 4 using the sample from 198020178. This is a commonly estimated model in applied macroeconomic practice and is the empirical version of the Samuelson multiplier-accelerator model discussed in Section 14.24. The coefficient estimates, conventional (heteroskedasticity-robust) standard errors, Newey-West (with M=5 ) standard errors, and AIC, are displayed in Table 14.1. This sample has 152 observations. The model selected by the AIC criterion is the AR(2). The estimated model has positive and small values for the first two autoregressive coefficients. This means that quarterly output growth

8 This sub-sample was used for estimation as it has been argued that the growth rate of U.S. GDP slowed around this period. The goal was to estimate the model over a period of time when the series is plausibly stationary. Table 14.1: U.S. GDP AR Models

α0 AR(0) AR(1) AR(2) AR(3) AR(4)
0.65 0.40 0.34 0.34 0.34
(0.06) (0.08) (0.10) (0.10) (0.11)
[0.09] [0.08] [0.09] [0.09] [0.09]
0.39 0.34 0.33 0.34
(0.09) (0.10) (0.10) (0.10)
α2 [0.10] [0.10] [0.10] [0.10]
0.14 0.13 0.13
(0.11) (0.13) (0.14)
α3 [0.10] [0.10] [0.11]
0.02 0.03
(0.11) (0.12)
α4 [0.07] [0.09]
0.02
(0.12)
AIC 329 306 305 307 309
  1. Standard errors robust to heteroskedasticity in parenthesis.

  2. Newey-West standard errors in square brackets, with M=5.

rates are positively correlated from quarter to quarter, but only mildly so, and most of the correlation is captured by the first lag. The coefficients of this model are in the real section of Figure 14.6, meaning that the dynamics of the estimated model do not display oscillations. The coefficients of the estimated AR(4) model are nearly identical to the AR(2) model. The conventional and Newey-West standard errors are somewhat different from one another for the AR(0) and AR(4) models, but are nearly identical to one another for the AR(1) and AR(2) models

Our second example is real non-durables consumption growth rates Ct (growth rate of pcndx ). This is motivated by an influential paper by Robert Hall (1978) who argued that the permanent income hypothesis implies that changes in consumption should be unpredictable (martingale differences). To test this model Hall (1978) estimated an AR(4) model. Our estimated regression using the full sample (n=231) is reported in the following equation.

Here, we report heteroskedasticity-robust standard errors. Hall’s hypothesis is that all autoregressive coefficients should be zero. We test this joint hypothesis with an F statistic and find F=3.32 with a p-value of p=0.012. This is significant at the 5 level and close to the 1 level. The first three autoregressive coefficients appear to be positive, but small, indicating positive serial correlation. This evidence is (mildly) inconsistent with Hall’s hypothesis. We report heteroskedasticity-robust standard errors (not Newey-West standard errors) since the purpose was to test the hypothesis of no serial correlation. Table 14.2: U.S. Inflation AR Models

α0 AR(1) AR(2) AR(3) AR(4) AR(5)
0.004 0.003 0.003 0.003 0.003
(0.034) (0.032) (0.032) (0.032) (0.032)
[0.023] [0.028] [0.029] [0.031] [0.032]
α1 0.26 0.36 0.36 0.36 0.37
(0.08) (0.07) (0.07) (0.07) (0.07)
[0.05] [0.07] [0.07] [0.07] [0.07]
α2 0.36 0.37 0.42 0.43
(0.07) (0.06) (0.06) (0.06)
[0.06] [0.05] [0.07] [0.07]
α3 0.00 0.06 0.08
(0.09) (0.10) (0.11)
[0.09] [0.12] [0.13]
α4 0.16 0.18
(0.08) (0.08)
[0.09] [0.09]
α5 0.04
(0.07)
[0.06]
AIC 342 312 314 310 312
  1. Standard errors robust to heteroskedasticity in parenthesis.

  2. Newey-West standard errors in square brackets, with M=5.

The third example is the first difference of CPI inflation (first difference of growth rate of cpiaucsl). This is motivated by Stock and Watson (2007) who examined forecasting models for inflation rates. We estimate autoregressive models of order 1 through 8 using the full sample ( n=226); we report models 1 through 5 in Table 14.2. The model with the lowest AIC is the AR(4). All four estimated autoregressive coefficients are negative, most particularly the first two. The two sets of standard errors are quite similar for the AR(4) model. There are meaningful differences only for the lower order AR models.

14.40 Time Series Regression Models

Least squares regression methods can be used broadly with stationary time series. Interpretation and usefulness can depend, however, on constructive dynamic specifications. Furthermore, it is necessary to be aware of the serial correlation properties of the series involved, and to use the appropriate covariance matrix estimator when the dynamics have not been explicitly modeled.

Let (Yt,Xt) be paired observations with Yt the dependent variable and Xt a vector of regressors including an intercept. The regressors can contain lagged Yt so this framework includes the autoregressive model as a special case. A linear regression model takes the form

Yt=Xtβ+et.

The coefficient vector is defined by projection and therefore equals

β=(E[XtXt])1E[XtYt].

The error et is defined by (14.54) and thus its properties are determined by that relationship. Implicitly the model assumes that the variables have finite second moments and E[XtXt]>0, otherwise the model is not uniquely defined and a regressor could be eliminated. By the property of projection the error is uncorrelated with the regressors E[Xtet]=0.

The least squares estimator of β is

β^=(t=1nXtXt)1(t=1nXtYt).

Under the assumption that the joint series (Yt,Xt) is strictly stationary and ergodic the estimator is consistent. Under the mixing and moment conditions of Theorem 14.32 the estimator is asymptotically normal with a general covariance matrix

However, under the stronger assumption that the error is a MDS the asymptotic covariance matrix simplifies. It is worthwhile investigating this condition further. The necessary condition is E[etFt1]= 0 where Ft1 is an information set to which (et1,Xt) is adapted. This notation may appear somewhat odd but recall in the autoregessive context that Xt=(1,Yt1,,Ytp) contains variables dated time t1 and previously, thus Xt in this context is a “time t1” variable. The reason why we need (et1,Xt) to be adapted to Ft1 is that for the regression function Xtβ to be the conditional mean of Yt given Ft1,Xt must be part of the information set Ft1. Under this assumption

E[XtetFt1]=XtE[etFt1]=0

so (Xtet,Ft) is a MDS. This means we can apply the MDS CLT to obtain the asymptotic distribution.

We summarize this discussion with the following formal statement.

Theorem 14.35 If (Yt,Xt) is strictly stationary, ergodic, with finite second moments, and Q=E[XtXt]>0, then β in (14.55) is uniquely defined and the least squares estimator is consistent, β^pβ.

If in addition, E[etFt1]=0, where Ft1 is an information set to which (et1,Xt) is adapted, E|Yt|4<, and EXt4<, then

n(β^β)dN(0,Q1ΩQ1)

as n, where Ω=E[XtXtet2]

Alternatively, if for some r>4, E|Yt|r<, EXtr<, and the mixing coefficients for (Yt,Xt) satisfy =1α()14/r<, then (14.56) holds with

Ω==E[XtXtetet].

14.41 Static, Distributed Lag, and Autoregressive Distributed Lag Models

In this section we describe standard linear time series regression models.

Let (Yt,Zt) be paired observations with Yt the dependent variable and Zt an observed regressor vector which does not include lagged Yt.

The simplest regression model is the static equation

Yt=α+Ztβ+et.

This is (14.54) by setting Xt=(1,Zt). Static models are motivated to describe how Yt and Zt co-move. Their advantage is their simplicity. The disadvantage is that they are difficult to interpret. The coefficient is the best linear predictor (14.55) but almost certainly is dynamically misspecified. The regression of Yt on contemporeneous Zt is difficult to interpret without a causal framework since the two may be simultaneous. If this regression is estimated it is important that the standard errors be calculated using the Newey-West method to account for serial correlation in the error.

A model which allows the regressor to have impact over several periods is called a distributed lag (DL) model. It takes the form

Yt=α+Zt1β1+Zt2β2++Ztqβq+et.

It is also possible to include the contemporenous regressor Zt. In this model the leading coefficient β1 represents the initial impact of Zt on Yt,β2 represents the impact in the second period, and so on. The cumulative impact is the sum of the coefficients β1++βq which is called the long-run multiplier.

The distributed lag model falls in the class (14.54) by setting Xt=(1,Zt1,Zt2,,Ztq). While it allows for a lagged impact of Zt on Yt, the model does not incorporate serial correlation so the error et should be expected to be serially correlated. Thus the model is (typically) dynamically misspecified which can make interpretation difficult. It is also necessary to use Newey-West standard errors to account for the serial correlation.

A more complete model combines autoregressive and distributed lags. It takes the form

Yt=α0+α1Yt1++αpYtp+Zt1β1++Ztqβq+et.

This is called an autoregressive distributed lag (AR-DL) model. It nests both the autoregressive and distributed lag models thereby combining serial correlation and dynamic impact. The AR-DL model falls in the class (14.54) by setting Xt=(1,Yt1,,Ytp,Zt1,,Ztq).

If the lag orders p and q are selected sufficiently large the AR-DL model will have an error which is approximately white noise in which case the model can be interpreted as dynamically well-specified and conventional standard error methods can be used.

In an AR-DL specification the long-run multiplier is

β1++βq1α1αp

which is a nonlinear function of the coefficients.

14.43 Illustration

We illustrate the models described in the previous section using a classical Phillips curve for inflation prediction. A. W. Phillips (1958) famously observed that the unemployment rate and the wage inflation rate are negatively correlated over time. Equations relating the inflation rate, or the change in the inflation rate, to macroeconomic indicators such as the unemployment rate are typically described as “Phillips curves”. A simple Phillips curve takes the form

Δπt=α+βUt+et

where πt is price inflation and Ut is the unemployment rate. This specification relates the change in inflation in a given period to the level of the unemployment rate in the previous period.

The least squares estimate of (14.58) using U.S. quarterly series from FRED-QD is reported in the first column of Table 14.3. Both heteroskedasticity-robust and Newey-West standard errors are reported. The Newey-West standard errors are the appropriate choice since the estimated equation is static - no modeling of the serial correlation. In this example the measured impact of the unemployment rate on inflation appears minimal. The estimate is consistent with a small effect of the unemployment rate on the inflation rate but it is not precisely estimated.

A distributed lag (DL) model takes the form

Δπt=α+β1Ut1+β2Ut2++βqUtq+et.

The least squares estimate of (14.59) is reported in the second column of Table 14.3. The estimates are quite different from the static model. We see large negative impacts in the first and third periods, countered by a large positive impact in the second period. The model suggests that the unemployment rate has a strong impact on the inflation rate but the long-run impact is mitigated. The long-run multiplier is reported at the bottom of the column. The point estimate of 0.022 is quite small and similar to the static estimate. It implies that an increase in the unemployment rate by 5 percentage points (a typical recession) decreases the long-run annual inflation rate by about a half of a percentage point.

An AR-DL takes the form

Δπt=α0+α1Δπt1++αpΔπtp+β1Ut1++βqUtq+et.

The least squares estimate of (14.60) is reported in the third column of Table 14.3. The coefficient estimates are similar to those from the distributed lag model. The point estimate of the long-run multiplier is also nearly identical but with a smaller standard error.

14.44 Granger Causality

In the AR-DL model (14.60) the unemployment rate has no predictive impact on the inflation rate under the coefficient restriction β1==βq=0. This restriction is called Granger non-causality. When the coefficients are non-zero we say that the unemployment rate “Granger causes” the inflation rate. This definition of causality was developed by Granger (1969) and Sims (1972).

The reason why we call this “Granger causality” rather than “causality” is because this is not a structural definition. An alternative label is “predictive causality”.

To be precise, assume that we have two series (Yt,Zt). Consider the projection of Yt onto the lagged history of both series

Yt=Pt1(Yt)+et=α0+j=1αjYtj+j=1βjZtj+et

Table 14.3: Phillips Curve Regressions

  1. Standard errors robust to heteroskedasticity in parenthesis.

  2. Newey-West standard errors in square brackets with M=5. We say that Zt does not Granger-cause Yt if βj=0 for all j. If βj0 for some j then we say that Zt Granger-causes Yt.

It is important that the definition includes the projection on the past history of Yt. Granger causality means that Zt helps to predict Yt even after the past history of Yt has been accounted for.

The definition can alternatively be written in terms of conditional expectations rather than projections. We can say that Zt does not Granger-cause Yt if

E[YtYt1,Yt2;Zt1,Zt2,]=E[YtYt1,Yt2,].

Granger causality can be tested in AR-DL models using a standard Wald or F test. In the context of model (14.60) we report the F statistic for β1==βq=0. The test rejects the hypothesis (and thus finds evidence of Granger causality) if the statistic is larger than the critical value (if the p-value is small) and fails to reject the hypothesis (and thus finds no evidence of causality) if the statistic is smaller than the critical value.

For example, in the results presented in Table 14.3 the F statistic for the hypothesis β1==β4=0 using the Newey-West covariance matrix is F=6.98 with a p-value of 0.000. This is statistically significant at any conventional level so we can conclude that the unemployment rate has a predictively causal impact on inflation.

Granger causality should not be interpreted structurally outside the context of an economic model. For example consider the regression of GDP growth rates Yt on stock price growth rates Rt. We use the quarterly series from FRED-QD, estimating an AR-DL specification with two lags

The coefficients on the lagged stock price growth rates are small in magnitude but the first lag appears statistically significant. The F statistic for exclusion of (Rt1,Rt2) is F=9.3 with a p-value of 0.0002, which is highly significant. We can therefore reject the hypothesis of no Granger causality and deduce that stock prices Granger-cause GDP growth. We should be wary of concluding that this is structurally causal - that stock market movements cause output fluctuations. A more reasonable explanation from economic theory is that stock prices are forward-looking measures of expected future profits. When corporate profits are forecasted to rise the value of corporate stock rises, bidding up stock prices. Thus stock prices move in advance of actual economic activity but are not necessarily structurally causal.

14.45 Testing for Serial Correlation in Regression Models

Consider the problem of testing for omitted serial correlation in an AR-DL model such as

Yt=α0+α1Yt1++αpYtp+β1Zt1++βqZtq+ut.

The null hypothesis is that ut is serially uncorrelated and the alternative hypothesis is that it is serially correlated. We can model the latter as a mean-zero autoregressive process

ut=θ1ut1++θrutr+et.

The hypothesis is

H0:θ1==θr=0H1:θj0 for some j1

There are two ways to implement a test of H0 against H1. The first is to estimate equations (14.61)(14.62) sequentially by least squares and construct a test for H0 on the second equation. This test is complicated by the two-step estimation. Therefore this approach is not recommended.

The second approach is to combine equations (14.61)-(14.62) into a single model and execute the test as a restriction within this model. One way to make this combination is by using lag operator notation. Write (14.61)-(14.62) as

α(L)Yt=α0+β(L)Zt1+utθ(L)ut=et

Applying the operator θ(L) to the first equation we obtain

θ(L)α(L)Yt=θ(L)α0+θ(L)β(L)Zt1+θ(L)ut

or

α( L)Yt=α0+β( L)Zt1+et

where α( L) is a p+r order polynomial and β( L) is a q+r order polynomial. The restriction H0 is that these are p and q order polynomials. Thus we can implement a test of H0 against H1 by estimating an AR-DL model with p+r and q+r lags, and testing the exclusion of the final r lags of Yt and Zt. This test has a conventional asymptotic distribution so is simple to implement.

The basic message is that testing for omitted serial correlation can be implement in regression models by estimating and contrasting different dynamic specifications.

14.46 Bootstrap for Time Series

Recall that the bootstrap approximates the sampling distribution of estimators and test statistics by the empirical distribution of the observations. The traditional nonparametric bootstrap is appropriate for independent observations. For dependent observations alternative methods should be used.

Bootstrapping for time series is considerably more complicated than the cross section case. Many methods have been proposed. One of the challenges is that theoretical justifications are more difficult to establish than in the independent observation case.

In this section we describe the most popular methods to implement bootstrap resampling for time series data.

14.47 Recursive Bootstrap

  1. Estimate a complete model such as an AR(p) producing coefficient estimates α^ and residuals e^t.

  2. Fix the initial condition (Yp+1,Yp+2,,Y0).

  3. Simulate i.i.d. draws et from the empirical distribution of the residuals {e^1,,e^n}.

  4. Create the bootstrap series Yt by the recursive formula

Yt=α^0+α^1Yt1+α^2Yt2++α^pYtp+et.

This construction creates bootstrap samples Yt with the stochastic properties of the estimated AR(p) model including the auxiliary assumption that the errors are i.i.d. This method can work well if the true process is an AR(p). One flaw is that it imposes homoskedasticity on the errors et which may be different than the properties of the actual et. Another limitation is that it is inappropriate for AR-DL models unless the conditioning variables are strictly exogenous.

There are alternative versions of this basic method. First, instead of fixing the initial conditions at the sample values a random block can be drawn from the sample. The difference is that this produces an unconditional distribution rather than a conditional one. Second, instead of drawing the errors from the residuals a parametric (typically normal) distribution can be used. This can improve precision when sample sizes are small but otherwise is not recommended.

14.48 Pairwise Bootstrap

  1. Write the sample as {Yt,Xt} where Xt=(Yt1,,Ytp) contains the lagged values used in estimation.

  2. Apply the traditional nonparametric bootstrap which samples pairs (Yt,Xt) i.i.d. from {Yt,Xt} with replacement to create the bootstrap sample.

  3. Create the bootstrap estimates on this bootstrap sample, e.g. regress Yt on Xt.

This construction is essentially the traditional nonparametric bootstrap but applied to the paired sample {Yt,Xt}. It does not mimic the time series correlations across observations. However, it does produce bootstrap statistics with the correct first-order asymptotic distribution under MDS errors. This method may be useful when we are interested in the distribution of nonlinear functions of the coefficient estimates and therefore desire an improvement on the Delta Method approximation.

14.49 Fixed Design Residual Bootstrap

  1. Write the sample as {Yt,Xt,e^t} where Xt=(Yt1,,Ytp) contains the lagged values used in estimation and e^t are the residuals.

  2. Fix the regressors Xt at their sample values.

  3. Simulate i.i.d. draws et from the empirical distribution of the residuals {e^1,,e^n}.

  4. Set Yt=Xtβ^+et.

This construction is similar to the pairwise bootstrap but imposes an i.i.d. error. It is therefore only valid when the errors are i.i.d. (and thus excludes heteroskedasticity).

14.50 Fixed Design Wild Bootstrap

  1. Write the sample as {Yt,Xt,e^t} where Xt=(Yt1,,Ytp) contains the lagged values used in estimation and e^t are the residuals.

  2. Fix the regressors Xt and residuals e^t at their sample values.

  3. Simulate i.i.d. auxiliary random variables ξt with mean zero and variance one. See Section 10.29 for a discussion of choices.

  4. Set et=ξte^t and Yt=Xtβ^+et

This construction is similar to the pairwise and fixed design bootstrap combined with the wild bootstrap. This imposes the conditional mean assumption on the error but allows heteroskedasticity.

14.51 Block Bootstrap

  1. Write the sample as {Yt,Xt} where Xt=(Yt1,,Ytp) contains the lagged values used in estimation.

  2. Divide the sample of paired observations {Yt,Xt} into n/m blocks of length m.

  3. Resample complete blocks. For each simulated sample draw n/m blocks.

  4. Paste the blocks together to create the bootstrap time series {Yt,Xt}.

This construction allows for arbitrary stationary serial correlation, heteroskedasticity, and modelmisspecification. One challenge is that the block bootstrap is sensitive to the block length and the way that the data are partitioned into blocks. The method may also work less well in small samples. Notice that the block bootstrap with m=1 is equal to the pairwise bootstrap and the latter is the traditional nonparametric bootstrap. Thus the block bootstrap is a natural generalization of the nonparametric bootstrap.

14.52 Technical Proofs*

Proof of Theorem 14.2 Define Y~t=(Yt,Yt1,Yt2,)Rm× as the history of Yt up to time t. Write Xt=ϕ(Y~t). Let B be the pre-image of {Xtx} (the vectors Y~Rm× such that ϕ(Y~)x). Then

P[Xtx]=P[ϕ(Y~t)x]=P[Y~tB].

Since Yt is strictly stationary, P[Y~tB] is independent 9 of t. This means that the distribution of Xt is independent of t. This argument can be extended to show that the distribution of (Xt,,Xt+) is independent of t. This means that Xt is strictly stationary as claimed.

Proof of Theorem 14.3 By the Cauchy criterion for convergence (see Theorem A.2 of Probability and Statistics for Economists), SN=j=0NajYtj converges almost surely if for all ϵ>0,

infNsupj>N|SN+jSN|ϵ.

9 An astute reader may notice that the independence of P[Y~tB] from t does not follow directly from the definition of strict stationarity. Indeed, a full derivation requires a measure-theoretic treatment. See Section 1.2.B of Petersen (1983) or Section 3.5 of Stout (1974). Let Aϵ be this event. Its complement is

Aϵc=N=1{supj>N|i=N+1N+jaiYti|>ϵ}.

This has probability

P[Aϵc]limNP[supj>N|i=N+1N+jaiYti|>ϵ]limN1ϵE[supj>N|i=N+1N+jaiYti|]1ϵlimNi=N+1|ai|E|Yti|=0.

The second equality is Markov’s inequality (B.36) and the following is the triangle inequality (B.1). The limit is zero because i=0|ai|< and E|Yt|<. Hence for all ϵ>0,P[Aεc]=0 and P[Aϵ]=1. This means that SN converges with probability one, as claimed.

Since Yt is strictly stationary then Xt is as well by Theorem 14.2.

Proof of Theorem 14.4 See Theorem 14.14.

Proof of Theorem 14.5 Strict stationarity follows from Theorem 14.2. Let Y~t and X~t be the histories of Yt and Xt. Write Xt=ϕ(Y~t). Let A be an invariant event for Xt. We want to show P[A]=0 or 1 . The event A is a collection of X~t histories, and occurs if and and only if an associated collection of Y~t histories occur. That is, for some sets G and H,

A={X~tG}={ϕ(Y~t)G}={Y~tH}.

The assumption that A is invariant means it is unaffected by the time shift, thus can be written as

A={X~t+G}={Y~t+H}.

This means the event {Y~t+H} is invariant. Since Yt is ergodic the event has probability 0 or 1. Hence P[A]=0 or 1 , as desired.

Proof of Theorem 14.7 Suppose Yt is discrete with support on (τ1,,τN) and without loss of generality assume E[Yt]=0. Then by Theorem 14.8

limn1n=1ncov(Yt,Yt+)=limn1n=1nE[YtYt+]=limn1n=1nj=1Nk=1NτjτkP[Yt=τj,Yt+=τk]=j=1Nk=1Nτjτklimn1n=1nP[Yt=τj,Yt+=τk]=j=1Nk=1NτjτkP[yt=τj]P[Yt+=τk]=E[Yt]E[Yt+]=0.

which is (14.4). This can be extended to the case of continuous distributions using the monotone convergence theorem. See Corollary 14.8 of Davidson (1994).

Proof of Theorem 14.9 We show (14.6). (14.7) follows by Markov’s inequality (B.36). Without loss of generality we focus on the scalar case and assume E[Yt]=0. Fix ϵ>0. Pick B large enough such that

E|Yt1{|Yt|>B}|ϵ4

which is feasible because E|Yt|<. Define

Wt=Yt1{|Yt|B}E[Yt1{|Yt|B}]Zt=Yt1{|Yt|>B}E[Yt1{|Yt|>B}].

Notice that Wt is a bounded transformation of the ergodic series Yt. Thus by (14.4) and (14.9) there is an n sufficiently large so that

var[Wt]n+2nm=1n(1mn)cov(Wt,Wj)ϵ24

By the triangle inequality (B.1)

E|Y¯|=E|W¯+Z¯|E|W¯|+E|Z¯|.

By another application of the triangle inequality and (14.63)

E|Z¯|E|Zt|2E|Yt1(|Yt|>B)|ϵ2.

By Jensen’s inequality (B.27), direct calculation, and (14.64)

(E|W¯|)2E[|W¯|2]=1n2t=1nj=1nE[WtWj]=var[Wt]n+2nm=1n(1mn)cov(Wt,Wj)ϵ24.

Thus

E|W¯|ϵ2.

Together, (14.65), (14.66) and (14.67) show that E|Y¯|ϵ. Since ε is arbitrary, this establishes (14.6) as claimed.

Proof of Theorem 14.11 (sketch) By the Cramér-Wold device (Theorem 8.4 from Probability and Statistics for Economists) it is sufficient to establish the result for scalar ut. Let σ2=E[ut2]. By a Taylor series expansion, for x small log(1+x)xx2/2. Taking exponentials and rearranging we obtain the approximation

Fix λ. Define

exp(x)(1+x)exp(x22).

Tj=i=1j(1+λnut)Vn=1nt=1nut2.

Since ut is strictly stationary and ergodic, Vnpσ2 by the Ergodic Theorem (Theorem 14.9). Since ut is a MDS

E[Tn]=1.

To see this, define Ft=σ(,ut1,ut). Note Tj=Tj1(1+λnuj). By iterated expectations

E[Tn]=E[E[TnFn1]]=E[Tn1E[1+λnunFn1]]=E[Tn1]==E[T1]=1.

This is (14.69).

The moment generating function of Sn is

E[exp(λnt=1nut)]=E[i=1nexp(λnut)]E[i=1n[1+λnut]exp(λ22nut2)]=E[Tnexp(λ2Vn2)]E[Tnexp(λ2σ22)]=exp(λ2σ22).

The approximation in (14.70) is (14.68). The approximation (14.71) is Vnpσ2. (A rigorous justification which allows this substitution in the expectation is technical.) The final equality is (14.69). This shows that the moment generating function of Sn is approximately that of N(0,σ2), as claimed.

The assumption that ut is a MDS is critical for (14.69). Tn is a nonlinear function of the errors ut so a white noise assumption cannot be used instead. The MDS assumption is exactly the minimal condition needed to obtain (14.69). This is why the MDS assumption cannot be easily replaced by a milder assumption such as white noise.

Proof of Theorem 14.13.1 Without loss of generality suppose E[Xt]=0 and E[Zt]=0. Set ηtm= sgn(E[ZtFtm]). By iterated expectations, |Xt|C1,|E[ZtFtm]|=ηtmE[ZtFtm], and again using iterated expectations

|cov(Xtm,Zt)|=|E[E[XtmZtFtm]]|=|E(XtmE[ZtFtm])|C1E|E[ZtFtm]|=C1E[ηtmE[ZtFtm]]=C1E[E[ηtmZtFtm]]=C1E[ηtmZt]=C1cov(ηtm,Zt).

Setting ξt=sgn(E[XtmFt]), by a similar argument (14.72) is bounded by C1C2cov(ηtm,ξt). Set A1=1{ηtm=1},A2=1{ηtm=1},B1=1{ξt=1},B2=1{ξt=1}. We calculate

|cov(ηtm,ξt)|=∣P[A1B1]+P[A2B2]P[A2B1]P[A1B2]P[A1]P[B1]P[A2]P[B2]+P[A2]P[B1]+P[A1]P[B2]4α(m).

Together, |cov(Xtm,zt)|4C1C2α(m) as claimed.

Proof of Theorem 14.13.2 Assume E[Xt]=0 and E[Zt]=0. We first show that if |Xt|C then

|cov(Xt,Zt)|6C(E|Zt|r)1/rα()11/r.

Indeed, if α()=0 the result is immediate so assume α()>0. Set D=α()1/r(E|Zt|r)1/r,Vt=Zt1{|Zt|D} and Wt=Zt1{|Zt|>D}. Using the triangle inequality (B.1) and then part 1, because |Xt|C and |Vt|D,

|cov(Xt,Zt)||cov(Xt,Vt)|+|cov(Xt,Wt)|4CDα()+2CE|wt|.

Also,

E|Wt|=E|Zt1{|Zt|>D}|=E||Zt|r|Zt|r11{|Zt|>D}|E|Zt|rDr1=α()(r1)/r(E|Zt|r)1/r

using the definition of D. Together we have

|cov(Xt,Zt)|6C(E|Xt|r)1/rα()11/r.

which is (14.73) as claimed.

Now set C=α()1/r(E|Xt|r)1/r,Vt=Xt1{|Xt|C} and Wt=Xt1{|Xt|>C}. Using the triangle inequality and (14.73)

|cov(Xt,Zt)||cov(Vt,Zt)|+|cov(Wt,Zt)|.

Since |Vt|C, using (14.73) and the definition of C

|cov(Vt,Zt)|6C(E|Zt|q)1/qα()11/q=6(E|Xt|r)1/r(E|Zt|q)1/qα()11/q1/r.

Using Hölder’s inequality (B.31) and the definition of C

|cov(Wt,Zt)|2(E|Wt|q/(q1))(q1)/q(E|Zt|q)1/q=2(E[|Xt|q/(q1)1{|Xt|>C}])(q1)/q(E|Zt|q)1/q=2(E[|Xt|r|Xt|rq/(q1)1{|Xt|>C}])(q1)/q(E|Zt|q)1/q2Cr(q1)/q1(E|Xt|r)(q1)/q(E|Zt|q)1/q=2(E|Xt|r)1/r(E|Zt|q)1/qα()11/q1/r.

Together we have

|cov(Xt,Zt)|8(E|Xt|r)1/r(E|Zt|q)1/qα()11/r1/q

as claimed. Proof of Theorem 14.13.3 Set ηt=sgn(E[ZtFt]) which satisfies |ηt|1. Since ηt is Ft measurable, iterated expectations, using (14.73) with C=1, the conditional Jensen’s inequality (B.28), and iterated expectations,

E|E[ZtFt]|=E[ηtE[ZtFt]]=E[E[ηtZtFt]]=E[ηtZt]6(E|E[ZtFt]|r)1/rα()11/r6(E(E[|Zt|rFt]))1/rα()11/r=6(E|Zt|r)1/rα()11/r

as claimed.

Proof of Theorem 14.15 By the Cramér-Wold device (Theorem 8.4 of Probability and Statistics for Economists) it is sufficient to prove the result for the scalar case. Our proof method is based on a MDS approximation. The trick is to establish the relationship

ut=et+ZtZt+1

where et is a strictly stationary and ergodic MDS with E[et2]=Ω and E|Zt|<. Defining Sne=1nt=1net, we have

Sn=1nt=1n(et+ZtZt+1)=Sne+Z1nZn+1n.

The first component on the right side is asymptotically N(0,Ω) by the MDS CLT (Theorem 14.11). The second and third terms are op(1) by Markov’s inequality (B.36).

The desired relationship (14.74) holds as follows. Set Ft=σ(,ut1,ut),

et==0(E[ut+Ft]E[ut+Ft1])

and

Zt==0E[ut+Ft1].

You can verify that these definitions satisfy (14.74) given E[utFt]=ut. The variable Zt has a finite expectation because by the triangle inequality (B.1), Theorem 14.13.3, and the assumptions

E|Zt|=E|=0E[ut+Ft1]|6(E|ut|r)1/r=0α()11/r<

the final inequality because =0α()12/r< implies =0α()11/r<.

The series et in (14.76) has a finite expectation by the same calculation as for Zt. It is a MDS since by iterated expectations

E[etFt1]=E[=0(E[ut+Ft]E[ut+Ft1])Ft1]==0(E[E[ut+Ft]Ft1]E[E[ut+Ft1]Ft1])==0(E[ut+Ft1]E[ut+Ft1])=0.

It is strictly stationary and ergodic by Theorem 14.2 because it is a function of the history (,ut1,ut).

The proof is completed by showing that et has a finite variance which equals Ω. The trickiest step is to show that var[et]<. Since

E|Sn|var[Sn]Ω

(as shown in (14.17)) it follows that E|Sn|2Ω for n sufficiently large. Using (14.75) and E|Zt|<, for n sufficiently large,

E|Sne|E|Sn|+E|Z1|n+E|Zn+1|n3Ω.

Now define eBt=et1{|et|B}E[et1{|et|B}Ft1] which is a bounded MDS. By Theorem 14.11, 1nt=1neBtdN(0,σB2) where σB2=E[eBt2]. Since the sequence is uniformly integrable this implies

E|1nt=1neBt|E|N(0,σB2)|=2πσB

using E|N(0,1)|=2/π. We want to show that var[et]<. Suppose not. Then σB as B, so there will be some B sufficiently large such that the right-side of (14.78) exceeds the right-side of (14.77). This is a contradiction. We deduce that var[et]<.

Examining (14.75), we see that since var [Sn]Ω< and var[Sne]=var[et]< then var[Z1Zn+1]/n< . Since Zt is stationary, we deduce that var[Z1Zn+1]<. Equation (14.75) implies var [et]=var[Sne]= var[Sn]+o(1)Ω. We deduce that var[et]=Ω as claimed.

Proof of Theorem 14.17 (Sketch) Consider the projection of Yt onto (,et1,et). Since the projection errors et are uncorrelated, the coefficients of this projection are the bivariate projection coefficients bj= E[Ytetj]/E[etj2]. The leading coefficient is

b0=E[Ytet]σ2=j=1αjE[Ytjet]+E[et2]σ2=1

using Theorem 14.16. By Bessel’s Inequality (Brockwell and Davis, 1991, Corollary 2.4.1),

j=1bj2=σ4j=1(E[Ytet])2σ4(E[Yt2])2<

because E[Yt2]< by the assumption of covariance stationarity.

The error from the projection of Yt onto (,et1,et) is μt=Ytj=0bjetj. The fact that this can be written as (14.22) is technical. See Theorem 5.7.1 of Brockwell and Davis (1991). Proof of Theorem 14.22 In the text we showed that |λj|<1 is sufficient for Yt to be strictly stationary and ergodic. We now verify that |λj|<1 is equivalent to (14.35)-(14.37). The roots λj are defined in (14.34). Consider separately the cases of real roots and complex roots.

Suppose that the roots are real, which occurs when α12+4α20. Then |λj|<1 iff |α1|<2 and

α1+α12+4α22<1 and 1<α1α12+4α22.

Equivalently, this holds iff

α12+4α2<(2α1)2=44α1+α12 and α12+4α2<(2+α1)2=4+4α1+α12

or equivalently iff

α2<1α1 and α2<1+α1

which are (14.35) and (14.36). α12+4α20 and |α1|<2 imply α2α12/41, which is (14.37).

Now suppose the roots are complex, which occurs when α12+4α2<0. The squared modulus of the roots λj=(α1±α12+4α2)/2 are

|λj|2=(α12)2(α12+4α22)2=α2.

Thus the requirement |λj|<1 is satisfied iff α2>1, which is (14.37). α12+4α2<0 and α2>1 imply α12< 4α2<4, so |α1|<2. α12+4α2<0 and |α1|<2 imply α1+α2<α1α12/4<1 and α2α1<α12/4α1<1 which are (14.35) and (14.36).

Proof of Theorem 14.23 To complete the proof we need to establish that the eigenvalues λj of A defined in (14.40) equal the reciprocals of the roots rj of the autoregressive polynomial α(z) of (14.39). Our goal is therefore to show that if λ satisfies det(AIpλ)=0 then it satisfies α(1/λ)=0.

Notice that

AIpλ=(λ+α1α~aB)

where α~=(α2,,αp),a=(1,0,,0), and B is a lower-diagonal matrix with λ on the diagonal and 1 immediately below the diagonal. Notice that det(B)=(λ)p1 and by direct calculation

B1=(λ1000λ2λ100λ3λ200λp+1λp+2λ2λ1).

Using the properties of the determinant (Theorem A.1.5)

det(AIpλ)=det(λ+α1α~aB)=det(B)(λ+α1α~B1a)=(λ)p(1α1λ1α2λ2α3λ3αpλp)=(λ)pα(1/λ).

Thus if λ satisfies det(AIpλ)=0 then α(1/λ)=0 as required.

Proof of Theorem 14.24 By the Fundamental Theorem of Algebra we can factor the autoregressive polynomial as α(z)==1p(1λz) where λ=r1. By assumption |λ|<1. Inverting the autoregressive polynomial we obtain

α(z)1==1p(1λz)1==1p(j=0λjzj)=j=0(i1++ip=jλ1i1λpip)zj=j=0bjzj

with bj=i1++ip=jλ1i1λpip

Using the triangle inequality and the stars and bars theorem (Theorem 1.10 of Probability and Statistics for Economists)

|bj|i1++ip=j|λ1|i1|λp|ipi1++ip=jλj(p+j1j)λj=(p+j1)!(p1)!j!λj(j+1)pλj

as claimed. We next verify the convergence of j=0|bj|j=0(j+1)pλj. Note that

limj(j+1)pλj(j)pλj1=λ<1

By the ratio test (Theorem A.3.2 of Probability and Statistics for Economists) j=0(j+1)pλj is convergent.

Proof of Theorem 14.27 If Q is singular then there is some γ such that γQγ=0. We can normalize γ to have a unit coefficient on Yt1 (or the first non-zero coefficient other than the intercept). We then have that E[(Yt1(1,Yt2,,Ytp))ϕ)2]=0 for some ϕ, or equivalently E[(Yt(1,Yt1,,Ytp+1))ϕ)2]= 0. Setting β=(ϕ,0) this implies E[(YtβXt)2]=0. Since α is the best linear predictor we must have β=α. This implies σ2=E[(YtαXt)2]=0. This contradicts the assumption σ2>0. We conclude that Q is not singular.

14.53 Exercises

Exercise 14.1 For a scalar time series Yt define the sample autocovariance and autocorrelation

γ^(k)=n1t=k+1n(YtY¯)(YtkY¯)ρ^(k)=γ^(k)γ^(0)=t=k+1n(YtY¯)(YtkY¯)t=1n(YtY¯)2.

Assume the series is strictly stationary, ergodic, strictly stationary, and E[Yt2]<.

Show that γ^(k)pγ(k) and ρ^(k)pγ(k) as n. (Use the Ergodic Theorem.)

Exercise 14.2 Show that if (et,Ft) is a MDS and Xt is Ft-measurable then ut=Xt1et is a MDS.

Exercise 14.3 Let σt2=E[et2Ft1]. Show that ut=et2σt2 is a MDS.

Exercise 14.4 Continuing the previous exercise, show that if E[et4]< then

n1/2t=1n(et2σt2)dN(0,v2)

Express v2 in terms of the moments of et.

Exercise 14.5 A stochastic volatility model is

Yt=σtetlogσt2=ω+βlogσt12+ut

where et and ut are independent i.i.d. N(0,1) shocks.

  1. Write down an information set for which Yt is a MDS.

  2. Show that if |β|<1 then Yt is strictly stationary and ergodic.

Exercise 14.6 Verify the formula ρ(1)=θ/(1+θ2) for a MA(1) process.

Exercise 14.7 Verify the formula ρ(k)=(j=0θj+kθj)/(j=0qθj2) for a MA() process.

Exercise 14.8 Suppose Yt=Yt1+et with et i.i.d. (0,1) and Y0=0. Find var [Yt]. Is Yt stationary?

Exercise 14.9 Take the AR(1) model with no intercept Yt=α1Yt1+et.

  1. Find the impulse response function bj=etYt+j.

  2. Let α^1 be the least squares estimator of α1. Find an estimator of bj.

  3. Let s(α^1) be a standard error for α^1. Use the delta method to find a 95% asymptotic confidence interval for bj

Exercise 14.10 Take the AR(2) model Yt=α1Yt1+α2Yt1+et. (a) Find expressions for the impulse responses b1,b2,b3 and b4.

  1. Let (α^1,α^2) be the least squares estimator. Find an estimator of b2.

  2. Let V^ be the estimated covariance matrix for the coefficients. Use the delta method to find a 95 asymptotic confidence interval for b2.

Exercise 14.11 Show that the models

α(L)Yt=α0+et

and

α(L)Yt=μ+utα(L)ut=et

are identical. Find an expression for μ in terms of α0 and α(L).

Exercise 14.12 Take the model

α(L)Yt=utβ(L)ut=et

where α(L) and β(L) are p and q order lag polynomials. Show that these equations imply that

γ(L)Yt=et

for some lag polynomial γ(L). What is the order of γ(L) ?

Exercise 14.13 Suppose that Yt=et+ut+θut1 where ut and et are mutually independent i.i.d. (0,1) processes.

  1. Show that Yt is a MA(1) process Yt=ηt+ψηt1 for a white noise error ηt.

Hint: Calculate the autocorrelation function of Yt.

  1. Find an expression for ψ in terms of θ.

  2. Supposeθ=1. Find ψ.

Exercise 14.14 Suppose that

Yt=Xt+etXt=αXt1+ut

where the errors et and ut are mutually independent i.i.d. processes. Show that Yt is an ARMA(1,1) process.

Exercise 14.15 A Gaussian AR model is an autoregression with i.i.d. N(0,σ2) errors. Consider the Gaussian AR(1) model

Yt=α0+α1Yt1+etetN(0,σ2)

with |α1|<1. Show that the marginal distribution of Yt is also normal:

YtN(α01α1,σ21α12).

Hint: Use the MA representation of Yt. Exercise 14.16 Assume that Yt is a Gaussian AR(1) as in the previous exercise. Calculate the moments

μ=E[Yt]σY2=E[(Ytμ)2]κ=E[(Ytμ)4]

A colleague suggests estimating the parameters (α0,α1,σ2) of the Gaussian AR(1) model by GMM applied to the corresponding sample moments. He points out that there are three moments and three parameters, so it should be identified. Can you find a flaw in his approach?

Hint: This is subtle.

Exercise 14.17 Take the nonlinear process

Yt=Yt1αut1α

where ut is i.i.d. with strictly positive support.

  1. Find the condition under which Yt is strictly stationary and ergodic.

  2. Find an explicit expression for Yt as a function of (ut,ut1,).

Exercise 14.18 Take the quarterly series pnfix (nonresidential real private fixed investment) from FRED-QD.

  1. Transform the series into quarterly growth rates.

  2. Estimate an AR(4) model. Report using heteroskedastic-consistent standard errors.

  3. Repeat using the Newey-West standard errors, using M=5.

  4. Comment on the magnitude and interpretation of the coefficients.

  5. Calculate (numerically) the impulse responses for j=1,,10.

Exercise 14.19 Take the quarterly series oilpricex (real price of crude oil) from FRED-QD.

  1. Transform the series by taking first differences.

  2. Estimate an AR(4) model. Report using heteroskedastic-consistent standard errors.

  3. Test the hypothesis that the real oil prices is a random walk by testing that the four AR coefficients jointly equal zero.

  4. Interpret the coefficient estimates and test result.

Exercise 14.20 Take the monthly series unrate (unemployment rate) from FRED-MD.

  1. Estimate AR(1) through AR(8) models, using the sample starting in 1960 m1 so that all models use the same observations.

  2. Compute the AIC for each AR model and report.

  3. Which AR model has the lowest AIC? (d) Report the coefficient estimates and standard errors for the selected model.

Exercise 14.21 Take the quarterly series unrate (unemployment rate) and claimsx (initial claims) from FRED-QD. “Initial claims” are the number of individuals who file for unemployment insurance.

  1. Estimate a distributed lag regression of the unemployment rate on initial claims. Use lags 1 through 4. Which standard error method is appropriate?

  2. Estimate an autoregressive distributed lag regression of the unemployment rate on initial claims. Use lags 1 through 4 for both variables.

  3. Test the hypothesis that initial claims does not Granger cause the unemployment rate.

  4. Interpret your results.

Exercise 14.22 Take the quarterly series gdpcl (real GDP) and houst (housing starts) from FRED-QD. “Housing starts” are the number of new houses on which construction is started.

  1. Transform the real GDP series into its one quarter growth rate.

  2. Estimate a distributed lag regression of GDP growth on housing starts. Use lags 1 through 4. Which standard error method is appropriate?

  3. Estimate an autoregressive distributed lag regression of GDP growth on housing starts. Use lags 1 through 2 for GDP growth and 1 through 4 for housing starts.

  4. Test the hypothesis that housing starts does not Granger cause GDP growth.

  5. Interpret your results.