Univariate I: Classical Decomposition Approach to Univariate Time Series

What is this all about?

Suppose given a univariate time series data \(x_t\), \(t=1,\ldots,T\), where \(t\) refers to time and \(T\) to the sample size. We would like to have a statistical/probability model \(\{X_t\}_{t\in \mathbb{Z}}\) for the series, where \(X_t\)’s are random variables. Having a statistical model, we can use it for forecasting (prediction), which is one of the basic tasks of time series analysis.

Note: We shall focus on the situations where there is some dependence across times \(t\). Typically, there should be some “physical” reason for temporal dependence. Examples?

This lecture is about a classical approach to time series modeling based on the decomposition (possibly after a preliminary transformation): \[ X_t = T_t + S_t + Y_t,\quad t\in \mathbb{Z}, \] where \(\{T_t\}_{t\in\mathbb{Z}}\) is a (typically deterministic) model for trend, \(\{S_t\}_{t\in\mathbb{Z}}\) is a (typically deterministic periodic) model for seasonal variations, and \(\{Y_t\}_{t\in\mathbb{Z}}\) is a stationary time series model.

Motivating example: Lake Huron data

Annual measurements of the level, in feet, of Lake Huron 1875–1972.

data(LakeHuron)
ts <- LakeHuron
ts

## Time Series:
## Start = 1875 
## End = 1972 
## Frequency = 1 
##  [1] 580.38 581.86 580.97 580.80 579.79 580.39 580.42 580.82 581.40 581.32
## [11] 581.44 581.68 581.17 580.53 580.01 579.91 579.14 579.16 579.55 579.67
## [21] 578.44 578.24 579.10 579.09 579.35 578.82 579.32 579.01 579.00 579.80
## [31] 579.83 579.72 579.89 580.01 579.37 578.69 578.19 578.67 579.55 578.92
## [41] 578.09 579.37 580.13 580.14 579.51 579.24 578.66 578.86 578.05 577.79
## [51] 576.75 576.75 577.82 578.64 580.58 579.48 577.38 576.90 576.94 576.24
## [61] 576.84 576.85 576.90 577.79 578.18 577.51 577.23 578.42 579.61 579.05
## [71] 579.26 579.22 579.38 579.10 577.95 578.12 579.75 580.85 580.41 579.96
## [81] 579.61 578.76 578.18 577.21 577.13 579.10 578.25 577.91 576.89 575.96
## [91] 576.80 577.68 578.38 578.52 579.74 579.31 579.89 579.96

par(mfrow = c(1, 2)) 
plot.ts(ts,ylab = 'Depth in Feet',xlab='Year')
lts <- length(ts)
plot(y=ts[2:lts],x=ts[1:(lts-1)],ylab = 'Depth in feet',xlab='Previous Depth in Feet')

cor(ts[2:lts],ts[1:(lts-1)])

## [1] 0.8388905

A possibility is also to remove a linear trend.

years <- time(LakeHuron)
lm.LakeHuron <- lm(LakeHuron ~ years)
summary(lm.LakeHuron)

## 
## Call:
## lm(formula = LakeHuron ~ years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50997 -0.72726  0.00083  0.74402  2.53565 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 625.554918   7.764293  80.568  < 2e-16 ***
## years        -0.024201   0.004036  -5.996 3.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared:  0.2725, Adjusted R-squared:  0.2649 
## F-statistic: 35.95 on 1 and 96 DF,  p-value: 3.545e-08

ts2 <- resid(lm.LakeHuron)
ts2 <- ts(ts2,start=c(1875),end=c(1972),frequency=1)

par(mfrow = c(1, 2)) 
plot(ts2,ylab = 'Feet',xlab='Year')
lts2 <- length(ts2)
plot(y=ts2[2:lts2],x=ts2[1:(lts2-1)],ylab = 'Depth in feet',xlab='Previous Depth in Feet')

cor(ts2[2:lts2],ts2[1:(lts2-1)])

## [1] 0.7763291

lm.LakeHuron2 <- lm(ts2[2:lts2] ~ ts2[1:(lts2-1)])
summary(lm.LakeHuron2)

## 
## Call:
## lm(formula = ts2[2:lts2] ~ ts2[1:(lts2 - 1)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95881 -0.49932  0.00171  0.41780  1.89561 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.01529    0.07272    0.21    0.834    
## ts2[1:(lts2 - 1)]  0.79112    0.06590   12.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7161 on 95 degrees of freedom
## Multiple R-squared:  0.6027, Adjusted R-squared:  0.5985 
## F-statistic: 144.1 on 1 and 95 DF,  p-value: < 2.2e-16

ts3 <- resid(lm.LakeHuron2)

par(mfrow = c(1, 2)) 
plot(ts3, type="l")
lts3 <- length(ts3)
plot(y=ts3[2:lts3],x=ts3[1:(lts3-1)],ylab = 'ts3 current',xlab='ts3 previous')

cor(ts3[2:lts3],ts3[1:(lts3-1)])

## [1] 0.2183583

Possible conclusion:

Use the model \(X_t = a + \phi X_{t-1}+Z_t\) or \(X_t = a + bt + Y_t\) with \(Y_t =\phi Y_{t-1}+Z_t\), where \(Z_t\)’s are uncorrelated across time \(t\)?

Data/Theory goals of the lecture

In fact, after fitting a linear trend \(X_t = a + bt + Y_t\), we will choose the model \(Y_t =\phi_1 Y_{t-1}+\phi_2 Y_{t-2}+Z_t\) for the residuals and use it in forecasting to arrive at the following forecasting plot:

By the end of the lecture, you should understand how the model is fitted and where such a forecasting plot comes from. On the way, a number of important notions of time series analysis, such as stationarity, some stationary models, ACF, partial ACF, and others, will also be introduced and discussed.

Note: Though it may sometimes appear “simple”, the material of this lecture is most important. For example, it lays foundation for whatever will be done later in the course. Even when dealing with multivariate or high-dimensional or nonlinear or … time series, the univariate series modeling based on the introduced models is always the baseline that is to be improved upon. The lecture will also give you an idea of the mathematical level of the course.

General discussion on classical decomposition

Estimating/Removing trends and seasonality

Several methods:

Fitting a trend (e.g. linear)
Moving average/Filtering (smoothing)
Differencing (later)

In particular, a moving average estimate of e.g. a trend is defined as: \[ \widehat T_t = \sum_{j=-\infty}^\infty a_j x_{t+j} = \ldots + a_{-2} x_{t-2} + a_{-1} x_{t-1} + a_0 x_t + a_{1} x_{t+1} + a_{2} x_{t+2} + \ldots \] with a filter \(a = (a_j)\). Often, \(a_j = a_{-j}\) (symmetric) and \(a_j=0\) for \(|j|\geq J\) (finite). (An issue is what to do at the “boundary”.) Another basic requirement: \(\sum_j a_j = 1\) so that \(\sum_j a_j T_{t+j} = T_t\) for \(T_t = m\). Downside: How does one forecast such trend into the future?

For example: Spencer’s \(15\)-point moving average filter: \[ (a_0\ a_1\ldots a_7) = \frac{1}{320} (74\ 67\ 46\ 21\ 3\ -5\ -6\ -3),\quad a_j = a_{-j}, \quad a_j = 0\ \ \mbox{for}\ \ j>7. \] It reproduces polynomials of order \(3\), that is, \(\sum_j a_j T_{t+j} = T_t\) for all \(T_t = a+bt+ct^2+dt^3\). Creating filters with “good” properties is almost a science in itself, and there are a number of well-known filters for various purposes (e.g. Hodrick-Prescott, Baxter-King, etc).

In the absence of trend, a seasonal component is estimated as: for \(t=1,\ldots,s\), and supposing \(T = ms\), \[ \widehat S_t = \frac{x_t + x_{t+s} + x_{t+2s}+\ldots+x_{t+(m-1)s}}{m} \] and for \(t>s\), is extended periodically by using \(\widehat S_t = \widehat S_{t-s}\). That is, it is estimated as the seasonal mean.

In the presence of trend and seasonal component, a preliminary estimate of trend is often taken as \[ \widehat T_t = \left\{ \begin{array}{cl} \frac{1}{s}(0.5x_{t-\frac{s}{2}}+x_{t-\frac{s}{2}+1}+\ldots + x_{t+\frac{s}{2}-1}+ 0.5x_{t+\frac{s}{2}}), & \mbox{if}\ s\ \mbox{is even},\\ \frac{1}{s}(x_{t-\frac{s-1}{2}}+x_{t-\frac{s-1}{2}+1}+\ldots + x_{t+\frac{s-1}{2}-1}+ x_{t+\frac{s-1}{2}}), & \mbox{if}\ s\ \mbox{is odd} \end{array} \right. \] The preliminary trend is removed and a seasonal component is estimated. The seasonal component is then removed and a trend is reestimated e.g. through some other method. This is the idea of the so-called classical decomposition method.

Remarks

Filtering/moving average is one of the most basic and important operations in time series analysis. E.g. Application of two filters \(a\) and \(b\) sequentially is the same as applying one filter which is the convolution of the two filters \(a*b\), defined as \((a*b)_k = \sum_j a_j b_{k-j}\).
Depending on the application, the trend or seasonal component may drive much of the variability, and hence be the only components of practical interest. Thus, the science of creating filters should not be surprising.
Filtering and various filters are best understood in the so-called spectral domain. Spectral perspective of time series will be discussed in another lecture.
Different methods will lead to different models. A particular model may be preferred because: it is (relatively) simpler, has a more conventional fit (e.g. the residuals show normality), or provide a better out-of-sample forecasting performance.

Stationary time series models

After removing trend and seasonal components (if necessary), much of the focus of the univariate time series analysis is on modeling stationary time series.

Definition 1.1: A time series (model) \(X = \{X_t\}_{t\in\mathbb{Z}}\) is called stationary if \(EX_t^2<\infty\) for all \(t\in\mathbb{Z}\), \[ E X_t = m,\quad \mbox{for all}\ t\in\mathbb{Z}, \] that is, it has a constant mean, and \[ \mbox{Cov}(X_t,X_{t+h}) = E(X_t - EX_t)(X_{t+h}-EX_{t+h}) = E X_t X_{t+h} - m^2 = \gamma_X(h),\quad \mbox{for all}\ t,h\in\mathbb{Z}, \] for a function \(\gamma_X\), called the autocovariance function (ACVF) of \(X\), that is, its covariance between \(X_t\) and \(X_{t+h}\) depends only on the lag \(h\).

Note: This is the notion of weak stationarity. It is implied by the so-called strict stationarity, with the converse implication holding in general only for Gaussian series.

A notion related to ACVF is that of autocorrelation function (ACF) of a stationary series \(X\) defined as \[ \mbox{Corr}(X_t,X_{t+h}) = \frac{\mbox{Cov}(X_t,X_{t+h})}{\sqrt{\mbox{Var}(X_t)\mbox{Var}(X_{t+h})}} = \frac{\gamma_X(h)}{\gamma_X(0)} =: \rho_X(h),\quad \mbox{for all}\ t,h\in\mathbb{Z}, \] The ACF \(\rho_X(h)\) measures the correlation in time series \(X\) at lag \(h\). It satisfies \(\rho_X(0)=1\), \(|\rho_X(h)|\leq 1\), \(\rho_X(-h)=\rho_X(h)\) and it is non-negative definite.

Sample quantities:

The sample ACVF of \(x_1,\ldots,x_T\) is defined as \[ \widehat \gamma(h) = \frac{1}{T} \sum_{t=1}^{T-h} (x_{t+h} - \overline x) (x_t - \overline x) \] for \(h=0,1,\ldots,T-1\), and \(\widehat \gamma(h) = \widehat \gamma(-h)\) for \(h=-(T-1),\ldots,-1\), where \(\overline x = \frac{1}{T}\sum_{t=1}^T x_t\) is the sample mean. The sample ACF of \(x_1,\ldots,x_T\) is defined as \[ \widehat \rho(h) = \frac{\widehat \gamma(h)}{\widehat \gamma(0)},\quad h = -(T-1),\ldots,T-1. \] The plot of \(\widehat \rho(h)\) against \(h=0,\ldots,M\) is called a correlogram, and is used as a graphical tool in choosing a model.

Example 1.1 (White noise): A stationary series \(Z=\{Z_t\}_{t\in\mathbb{Z}}\) with \(E Z_t = 0\) and \[ \gamma_Z(h) = \left\{ \begin{array}{cl} EZ_t^2 = \sigma^2_Z, & h=0,\\ 0, & h\neq 0, \end{array} \right. \] is called white noise. The notation is \(\{Z_t\}\sim \mbox{WN}(0,\sigma^2_Z)\). Any series of iid random variables with finite variance is white noise, in which case one writes \(\{Z_t\}\sim \mbox{IID}(0,\sigma^2_Z)\).

wn1 <- rnorm(100, mean=0, sd=1)
par(mfrow = c(1, 2)) 
plot(wn1, type="l")
acf(wn1, lag.max=20, ylim=c(-1, 1))

wn2 <- rpois(100,lambda=1)
par(mfrow = c(1, 2)) 
plot(wn2, type="l")
acf(wn2, lag.max=20, ylim=c(-1, 1))

Example 1.2 (Autoregressive series): A stationary series \(X=\{X_t\}_{t\in\mathbb{Z}}\) with \(E X_t = 0\) is called autoregressive of order \(p\) (AR(\(p\)), for short) if it satisfies the AR equation: \[ X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \ldots + \phi_p X_{t-p} + Z_t, \] where \(\{Z_t\}\sim \mbox{WN}(0,\sigma^2_Z)\). The AR model can be conveniently written using the backward shift operator \(B\), defined as \[ B^k X_t = X_{t-k},\quad t,k\in \mathbb{Z}. \] The AR equation then reads \[ \phi(B) X_t = Z_t, \] where \(\phi(z) = 1 - \phi_1 z - \ldots - \phi_p z^p\) is referred to as the AR polynomial.

Much is known about AR series. For example:

The AR equation has a unique stationary solution if and only if \(\phi(z)\neq 0\) for \(|z|=1\) (that is, the polynomial \(\phi(z)\) has no roots on the unit circle). Moreover, the stationary solution is “causal” in terms of \(\{Z_t\}\) and identifiable when \(\phi(z)\neq 0\) for \(|z|<1\).
There are methods to compute explicitly its ACVF and ACF. For example, for AR\((1)\) series, \[ \gamma_X(h) = \frac{\sigma^2_Z \phi_1^{|h|}}{1 - \phi_1^2},\quad \rho_X(h) = \phi_1^{|h|}. \] For AR\((2)\) series, when \(\xi_1 = re^{i\theta}\) and \(\xi_2 = re^{-i\theta}\) are two complex roots of the polynomial \(\phi(z) = 1 - \phi_1 z - \phi_2 z^2\) with \(r>1\) and \(0<\theta<\pi\), \[ \gamma_X(h) = C_1(r,\theta) r^{-|h|} \sin\Big(\theta |h| + C_2(r,\theta)\Big). \]
Quite different behaviors could be captured already using AR(\(1\)) and AR(\(2\)) series. AR(\(1\)) with \(0<\phi_1<1\): positively correlated time series; AR(\(1\)) with \(-1<\phi_1<0\): negatively correlated time series; AR(\(2\)) series with two complex roots as above: series exhibiting cyclical behavior.

Remark: AR\((p)\) series are special cases of the more general ARMA\((p,q)\) time series satisfying the equation \[ X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \ldots + \phi_p X_{t-p} + Z_t + \theta_1 Z_{t-1}+...+\theta_q Z_{t-q}. \] (AR model is obtained when \(q=0\).) But AR series are flexible enough to capture any correlation structure. They are also ARIMA\((p,d,q)\), defined later, with \(d=0\).

ar1_1 <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = 100)
par(mfrow = c(1, 2)) 
plot(ar1_1, type="l")
acf(ar1_1, lag.max=20, ylim=c(-1, 1))

ar1_2 <- arima.sim(list(order = c(1,0,0), ar = -0.7), n = 100)
par(mfrow = c(1, 2)) 
plot(ar1_2, type="l")
acf(ar1_2, lag.max=20, ylim=c(-1, 1))

# r = 1.1, theta = pi/8 so that phi_1 = xi_1^(-1)+ xi_2^(-1) = 2*(1/r)*cos(theta), phi_2 = -xi_1^(-1)xi_2^(-1) = - 1/r^2

phi1 <- 2*(1/1.1)*cos(pi/8)
phi2 <- -1/1.1^2
ar2 <- arima.sim(list(order = c(2,0,0), ar = c(phi1,phi2) ), n = 200)
par(mfrow = c(1, 2)) 
plot(ar2, type="l")
acf(ar2, lag.max=40, ylim=c(-1, 1))

Fitting AR models and choosing their order

There are a number of ways to fit a stationary parametric model to a time series: e.g. Gaussian ML estimation through a fast calculation of the likelihood; (non-)linear least squares; moment estimation; and others. For AR model with fixed \(p\), all these methods essentially correspond to OLS estimation. For more general parametric families, the methods are not equivalent asymptotically and ML is most efficient.

ar1.model <- arima(ts2, order=c(1,0,0), include.mean = FALSE, method = "ML")
ar1.model

## 
## Call:
## arima(x = ts2, order = c(1, 0, 0), include.mean = FALSE, method = "ML")
## 
## Coefficients:
##          ar1
##       0.7826
## s.e.  0.0635
## 
## sigma^2 estimated as 0.4975:  log likelihood = -105.32,  aic = 214.65

ar1.model <- arima(ts2, order=c(1,0,0), include.mean = FALSE, method = "CSS")
ar1.model

## 
## Call:
## arima(x = ts2, order = c(1, 0, 0), include.mean = FALSE, method = "CSS")
## 
## Coefficients:
##          ar1
##       0.7909
## s.e.  0.0649
## 
## sigma^2 estimated as 0.5024:  part log likelihood = -105.33

lm.LakeHuron2 <- lm(ts2[2:lts2] ~ ts2[1:(lts2-1)] -1)
summary(lm.LakeHuron2)

## 
## Call:
## lm(formula = ts2[2:lts2] ~ ts2[1:(lts2 - 1)] - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.94335 -0.48386  0.01758  0.43251  1.91083 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## ts2[1:(lts2 - 1)]  0.79084    0.06556   12.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7125 on 96 degrees of freedom
## Multiple R-squared:  0.6025, Adjusted R-squared:  0.5984 
## F-statistic: 145.5 on 1 and 96 DF,  p-value: < 2.2e-16

ar2.model <- arima(ts2, order=c(2,0,0), include.mean = FALSE, method = "ML")
ar2.model

## 
## Call:
## arima(x = ts2, order = c(2, 0, 0), include.mean = FALSE, method = "ML")
## 
## Coefficients:
##          ar1      ar2
##       1.0050  -0.2925
## s.e.  0.0976   0.1002
## 
## sigma^2 estimated as 0.4572:  log likelihood = -101.26,  aic = 208.51

ar3.model <- arima(ts2, order=c(3,0,0), include.mean = FALSE, method = "ML")
ar3.model

## 
## Call:
## arima(x = ts2, order = c(3, 0, 0), include.mean = FALSE, method = "ML")
## 
## Coefficients:
##          ar1      ar2     ar3
##       1.0240  -0.3566  0.0639
## s.e.  0.1023   0.1451  0.1047
## 
## sigma^2 estimated as 0.4554:  log likelihood = -101.07,  aic = 210.14

arma11.model <- arima(ts2, order=c(1,0,1), include.mean = FALSE, method = "ML")
arma11.model

## 
## Call:
## arima(x = ts2, order = c(1, 0, 1), include.mean = FALSE, method = "ML")
## 
## Coefficients:
##          ar1     ma1
##       0.6513  0.3577
## s.e.  0.0945  0.1148
## 
## sigma^2 estimated as 0.4573:  log likelihood = -101.27,  aic = 208.53

Confidence intervals: How are the s.e. and confidence intervals computed? What is different from the usual regression?

Order selection:

The order \(p\) of AR model (more generally, the orders \(p,q\) of ARMA model) can be chosen through:

Using some information criterion (e.g. AIC, BIC, etc)
Partial ACF
Testing, e.g. likelihood ratio test
Cross validation

Using an information criterion: An information criterion chooses \(p\) that minimizes \[ - 2 \log L + r(T,p), \] where \(L\) is the likelihood (evaluated at ML estimates) and \(r(T,p)\) is a positive penalty depending on the sample size \(T\) and the order \(p\) (the larger the order/the number of parameters, the larger the penalty; likewise, the smaller the negative log-likelihood). The following penalties correspond to the listed criteria: \[ \begin{array}{ccl} T \log \frac{T+p}{T-p} & : & \mbox{FPE} \\ 2(p+1) & : & \mbox{AIC}\\ \frac{2(p+1)T}{T-p-2} & : & \mbox{AICC}\\ \log T (p+1) & : & \mbox{BIC} \end{array} \]

Notes:

One has \(- 2 \log L = C + T\log \widehat \sigma^2\), where one can think of \(\widehat \sigma^2\) as the sum of the residuals squared and \(C\) depends only on the sample size \(T\).
AIC is known to have a tendency to choose larger \(p\) (overfit the model).

library(forecast)

auto.arima(ts2,max.p=5,max.q=5,ic="aic",allowmean = FALSE)

## Series: ts2 
## ARIMA(1,0,1) with zero mean     
## 
## Coefficients:
##          ar1     ma1
##       0.6513  0.3577
## s.e.  0.0945  0.1148
## 
## sigma^2 estimated as 0.4668:  log likelihood=-101.27
## AIC=208.53   AICc=208.79   BIC=216.29

auto.arima(ts2,max.p=5,max.q=5,ic="bic",allowmean  = FALSE)

## Series: ts2 
## ARIMA(2,0,0) with zero mean     
## 
## Coefficients:
##          ar1      ar2
##       1.0050  -0.2925
## s.e.  0.0976   0.1002
## 
## sigma^2 estimated as 0.4667:  log likelihood=-101.26
## AIC=208.51   AICc=208.77   BIC=216.27

auto.arima(ts2,max.p=5,max.q=0,ic="aic",allowmean  = FALSE)

## Series: ts2 
## ARIMA(2,0,0) with zero mean     
## 
## Coefficients:
##          ar1      ar2
##       1.0050  -0.2925
## s.e.  0.0976   0.1002
## 
## sigma^2 estimated as 0.4667:  log likelihood=-101.26
## AIC=208.51   AICc=208.77   BIC=216.27

auto.arima(ts2,max.p=5,max.q=0,ic="bic",allowmean  = FALSE)

## Series: ts2 
## ARIMA(2,0,0) with zero mean     
## 
## Coefficients:
##          ar1      ar2
##       1.0050  -0.2925
## s.e.  0.0976   0.1002
## 
## sigma^2 estimated as 0.4667:  log likelihood=-101.26
## AIC=208.51   AICc=208.77   BIC=216.27

Using partial autocorrelation function:

Definition 1.2: The partial ACF (PACF) of \(X = \{X_t\}_{t\in\mathbb{Z}}\) is defined as \[ \alpha_X(0) = 1,\quad \alpha_X(h) = \phi_{hh},\quad h\geq 1, \] where \(\phi_{hh}\) is the coefficient at \(X_1\) in a linear predictor of \(X_{h+1}\), that is, \[ \widehat X_{h+1} = \phi_{h0} + \phi_{h1} X_{h}+\ldots + \phi_{hh} X_{1}. \]

The sample PACF \(\widehat \alpha(h)\), \(h\geq 1\), is defined as the coefficient \(\widehat \phi_{hh}\) in the regression of \(x_t\) on \(x_{t-1},\ldots,x_{t-h}\), that is, \[ x_t = \widehat \phi_{h0} + \widehat \phi_{h1} x_{t-1}+\ldots + \widehat \phi_{hh} x_{t-h} + \widehat z_t. \]

Note: There are other equivalent definitions, e.g. using the Frish-Waugh-Lovell theorem.

Note: For AR\((p)\) series, it is known that \[ \alpha_X(h) = \left\{ \begin{array}{cl} 1, & h=0,\\ \rho_X(1), & h = 1,\\ ..., & h=2,\ldots,p-1,\\ \phi_p, & h = p,\\ 0, & h\geq p+1. \end{array} \right. \]

par(mfrow = c(1, 2)) 
plot(ts2,ylab = 'Feet',xlab='Year')
pacf(ts2, lag.max=20, ylim=c(-1, 1))

Testing (e.g. likelihood ratio): Use the fact that \[ -2 \log L(p) + 2 \log L(p+1) = - 2 \log \Big( \frac{L(p)}{L(p+1)} \Big) \sim \chi^2(1), \]

under the hypothesis that the reduced AR\((p)\) model should be preferred.

qchisq(.95, df=1)

## [1] 3.841459

qchisq(.975, df=1)

## [1] 5.023886

qchisq(.999, df=1)

## [1] 10.82757

ar1.model <- arima(ts2, order=c(1,0,0), include.mean = FALSE, method = "ML")
ar2.model <- arima(ts2, order=c(2,0,0), include.mean = FALSE, method = "ML")
ar3.model <- arima(ts2, order=c(3,0,0), include.mean = FALSE, method = "ML")

t12 <- -2*ar1.model$loglik + 2*ar2.model$loglik
t12

## [1] 8.136994

t23 <- -2*ar2.model$loglik + 2*ar3.model$loglik
t23

## [1] 0.3712609

Using cross validation: Write an AR\((p)\) model as a linear regression \(Y_t = \mathbf{\Phi}(p)' \mathbf{X}_t(p) + \epsilon_t\) with \(Y_t = X_t\) and \(\mathbf{\Phi(p)} = (\phi_1 \ \phi_2\ \ldots \phi_p)'\) and \(\mathbf{X}_t(p) = (X_{t-1}\ X_{t-2}\ldots X_{t-p})'\). Choose \(p\) as \[ \mathop{\rm argmin}_{p} \mbox{CV}(p), \] where CV\((p)\) is the \(k\)-fold cross validation error for the AR(\(p\)) model. What does the latter mean? (Typical choice is \(k=10\).)

Model diagnostics

As with any regression model, a model fit should be examined through the model diagnostics.

resid.ar2 <- resid(ar2.model)
plot(resid.ar2,ylab = 'Residuals',xlab='Year')

par(mfrow = c(1, 2)) 
acf(resid.ar2, lag.max=20, ylim=c(-1, 1))
pacf(resid.ar2, lag.max=20, ylim=c(-1, 1))

qqnorm(resid.ar2, main="")
qqline(resid.ar2)

Box.test(resid.ar2, lag=32, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resid.ar2
## X-squared = 15.897, df = 32, p-value = 0.9922

Forecasting

How does one forecast based on an AR\((p)\) model? What is the forecast error?

h = 5
ts2.forecast <- predict(ar2.model, h)
round(ts2.forecast$pred, 3)

## Time Series:
## Start = 1973 
## End = 1977 
## Frequency = 1 
## [1] 1.545 0.930 0.483 0.213 0.073

round(ts2.forecast$se, 3)

## Time Series:
## Start = 1973 
## End = 1977 
## Frequency = 1 
## [1] 0.676 0.959 1.074 1.113 1.122

plus.or.minus <- 1.96 * ts2.forecast$se
lower <- ts2.forecast$pred - plus.or.minus
upper <- ts2.forecast$pred + plus.or.minus

fyears = (1973:1977)

plot(ts2, xlim=c(1875, 1977))
lines(fyears, ts2.forecast$pred, lwd=2)
lines(fyears, lower, lty=2, lwd=2)
lines(fyears, upper, lty=2, lwd=2)
title("Forecasts for the detrended time series")

trend.forecast <- lm.LakeHuron$coefficients[1] + lm.LakeHuron$coefficients[2]*fyears
ts.forecast.pred <- trend.forecast + ts2.forecast$pred 
lower = ts.forecast.pred - plus.or.minus
upper = ts.forecast.pred + plus.or.minus

plot(ts, xlim=c(1875, 1977))
lines(fyears, ts.forecast.pred, lwd=2)
lines(fyears, lower, lty=2, lwd=2)
lines(fyears, upper, lty=2, lwd=2)
title("Forecasts for the time series")

Some final notes

Preliminary transforms: The Box-Cox transformation of \(x_t\): \[ f_\lambda(x_t) = \left\{ \begin{array}{cl} \frac{x_t^\lambda - 1}{\lambda}, & x_t>0,\ \lambda>0,\\ \log x_t, & x_t>0,\ \lambda=0. \end{array} \right. \] It can also be applied to \(x_t+C\), especially when needing to have \(x_t+C>0\). For example: Quarterly earnings (dollars) per Johnson and Johnson share 1960–80:

data("JohnsonJohnson")
par(mfrow = c(1, 2)) 
plot.ts(JohnsonJohnson)
plot.ts(log(JohnsonJohnson))

The approach to time series described above was popularized by George Box and Gwilym Jenkins in the 1970’s; see e.g. Box and Jenkins (1970) (the most recent edition is (Box et al. 2016)). It is now the basis of all introductory books on time series analysis, including:
- Brockwell and Davis (2016). The more advanced, graduate version is Brockwell and Davis (2006).
- Shumway and Stoffer (2011). It has nice accompanying examples and code in R.
- Cowpertwait and Metcalfe (2009); Montgomery et al. (2008); and many others (in particular, in applied fields).
Various R packages for time series analysis are discussed in McLeod et al. (2011). A number of packages will be used as needed in the future lectures.
Many packages also allow to include in AR models other exogenous variables. E.g. the arima function used above allows this.

Questions/Homework

Q1: Apply the methodology outlined above to obtain a classical decomposition model and use it in forecasting for some real time series.

Q2: What is the idea of some of the model selection criteria, and why do they work for time series models?

Q3: How does one set confidence intervals for the mean \(\mu = E X_t\), the autocovariance \(\gamma(h)\) (including the variance when \(h=0\)) and the autocorrelation \(\rho(h)\) for fixed \(h\)?

Q4: Explain the various estimation methods “ML”, “CSS”, etc, referred to above. Are there any numerical pitfalls in performing estimation?

Q5: In what sense are AR models flexible enough to capture almost any correlation structure in stationary time series?

Q6: Discuss examples of parametric stationary models where different estimation methods (ML, moment estimation, etc.) have different asymptotic efficiency.

References

Box, G. E. P., and G. M. Jenkins. 1970. Times Series Analysis. Forecasting and Control. Holden-Day, San Francisco, Calif.-London-Amsterdam.

Box, G. E. P., G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. 2016. Time Series Analysis. Fifth. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ.

Brockwell, P. J., and R. A. Davis. 2006. Time Series: Theory and Methods. Springer Series in Statistics. Springer, New York.

———. 2016. Introduction to Time Series and Forecasting. Third. Springer Texts in Statistics. Springer.

Cowpertwait, P. S. P., and A. V. Metcalfe. 2009. Introductory Time Series with R. Springer.

McLeod, A. I., H. Yu, and E. Mahdi. 2011. “Time Series Analysis with R.” Handbook of Statistics 30. Elsevier: 78–93.

Montgomery, D. C., C. L. Jennings, and M. Kulahci. 2008. Introduction to Time Series Analysis and Forecasting. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ.

Shumway, R. H., and D. S. Stoffer. 2011. Time Series Analysis and Its Applications. Third. Springer Texts in Statistics. Springer, New York.