Multivariate II: Variable Selection and Penalized Estimation (LASSO, etc.)

What is this all about?

Vector Autoregressions (VARs) are canonical models for (stationary) multivariate time series data \(\mathbf{X}_t = (X_{1,t},\ldots,X_{d,t})'\), \(t=1,\ldots,T\). Note that the number of VAR\((p)\) model parameters is (assuming zero mean) \[ pd^2 + d(d-1)/2. \] This can be pretty large, even for moderate \(d\)’s. E.g.

p <- 2
d <- 5

p*d^2 + d*(d-1)/2

## [1] 60

p <- 3
d <- 10

p*d^2 + d*(d-1)/2

## [1] 345

# In the example used below:

p <- 2
d <- 51

p*d^2 + d*(d-1)/2

## [1] 6477

# While T = 281

Many of these coefficients are expected to be “non-significant” (i.e. zero). Choosing “significant” (i.e. non-zero) coefficients and “significant” variables/components (and their lags) is known as variable selection.

Data/Theory goals of the lecture

The goal is to review some of the variable selection methods. (A related question is choosing the order \(p\).) As dimension \(d\) increases, these variable selection methods become computationally expensive and/or do not work. In this case, penalized estimation (regularized estimation) methods such as LASSO and its variants come handy. We shall discuss some of these penalized estimation methods as well. The methods will be illustrated through the following Google flu trends time series.

Motivating example: Google flu trends

The following is a data set of Google flu trends that attracted quite a bit of attention (see Ginsberg et al. (2009), Dugas et al. (2013), Lazer et al. (2014)). The data are weakly measurements for 51 US states and the period September, 2003 - April, 2013, derived from flu related Google searches per state. Several states have records starting only a number of years following 2003.

The states are separated below by the CDC (Center for Disease Control) regions, as designated on their website. For example, “CT”, “ME”, “MA”, “NH”, “RI” and “VT” make one region designated by the CDC.

stateflu <- read.csv("states_abbrev.csv", sep=",", header=TRUE)
stateflu <- as.matrix(stateflu[,c(-1)])
stateflu <- log(na.omit(stateflu))


#plot data by state
x <- 1:dim(stateflu)[1]
matplot(x, stateflu, type = "l", lty = 1)

library(vars)
library(corrplot)

lag <- 2
cols <- dim(stateflu)[2]
rows <- dim(stateflu)[1]

fitvar <- VAR(stateflu, p = lag, type="const")

#get intercept estimate, coefficient matrices
mu <- matrix(, nrow = 1, ncol = cols)
states <- names(fitvar$varresult)
count <- 1
for (s in states) {
    temp <- fitvar$varresult[s]
    end <- length(coef(temp[[1]]))
      mu[count] = coef(temp[[1]])[end]
    count = count + 1
}

A1var <- Acoef(fitvar)[[1]]
A2var <- Acoef(fitvar)[[2]]

#change labels to state abbreviations
colnames(A1var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
colnames(A2var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")

rownames(A1var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A2var) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")

#reorder A1 by CDC regions
A1varbyregn <-  A1var[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]

#reorder A2 by CDC regions
A2varbyregn <-  A2var[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]

corrplot(round(A1varbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="VAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")

corrplot(round(A2varbyregn,1), mar=c(1,1,2,1), method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="VAR(2), A_2")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")

Note a number of small coefficients. Numerical “instability” (larger variability) of estimates is also expected since a large number of coefficients needs to be estimated from relatively small sample of size \(T\), and the regression matrix “\(X'X\)” might be ill-conditioned.

Question: What is the point of selecting a smaller number of variables?

Variable selection

A number of approaches have been suggested, and work well for smaller dimensions (number of parameters). The following is an approach based on \(t\)-statistics: Let \(\widehat \Phi_r(j,k) = \widehat \Phi_{r,jk}\), \(r=1,\ldots,p\), \(j,k=1,\ldots,d\), be the OLS/MLE estimators of the VAR coefficients.

Compute the \(t\)-statistics: \[ t_{r,j,k} = \frac{\widehat \Phi_r(j,k)}{\mbox{s.e.}(\widehat \Phi_r(j,k))}. \]
Create a sequence \(Q\) of the \(d^2p\) triplets \((r,j,k)\) by ranking the absolute values of the above \(t\)-ratios from highest to lowest.
For each \(m \in \{0, 1,\ldots, d^2p\}\), consider the model that selects the \(m\) non-zero VAR coefficients corresponding to the top \(m\) triplets in the sequence \(Q\). Under this parameter non-zero constraint, execute the constrained parameter estimation (see notes below) and compute the corresponding BIC according to \(\mbox{BIC}(m) = −2 \log L + \log T \cdot m\).
Choose \(m^*\) that gives the minimum BIC value.

The method seems to be working quite well in lower-dimensional settings. See several illustrations below.

Notes

Wasserman (2004), p. 222, refers to this as Zheng-Loh Model Selection Method, after Zheng and Loh (1995), who showed that in the regression setting and under appropriate conditions, this method chooses the true “sparse” model with probability tending to one as the sample size increases. This method for model selection avoids having to search through all possible models.
Two other common methods for variable selection are forward and backward stepwise regression. In forward stepwise regression, one starts with no covariates in the model. One then adds the one variable that leads to the best “score” (e.g. the lowest \(p\)-value). One continues adding variables one at a time until the “score” does not improve (e.g. the \(p\)-values are larger than some critical value). Backwards stepwise regression is the same except that one starts with the biggest model and drops one variable at a time. These can be used for VAR per equation.
Linearly constrained estimation for VARs is well-studied, with explicit formulae and asymptotic results for OLS, GLS and EGLS, the latter having the same efficiency as MLE. See, for example, Lutkepohl (2005), Ch. 5, p. 193; Hamilton (1994), Sec. 11.3, p. 309. More specifically, this is ususally expressed through a “restriction” matrix \(R\) so that \[ R\, \mbox{vec}(\Phi_1,\ldots,\Phi_p) = 0. \] Here is an example through the vars package for the data considered in the previous lecture, where we found the income at lag 1 to be non-significant for consumption.

library(vars)
library(fpp)

data(usconsumption)

var_consumption <- VAR(usconsumption,p=1,type="const")
restrict <- matrix(c(1, 0, 1,
                     1, 1, 1),
                       nrow=2, ncol=3, byrow=TRUE)
restrict(var_consumption, method = "manual", resmat = restrict)

## 
## VAR Estimation Results:
## ======================= 
## 
## Estimated coefficients for equation consumption: 
## ================================================ 
## Call:
## consumption = consumption.l1 + const 
## 
## consumption.l1          const 
##      0.3573904      0.4865219 
## 
## 
## Estimated coefficients for equation income: 
## =========================================== 
## Call:
## income = consumption.l1 + income.l1 + const 
## 
## consumption.l1      income.l1          const 
##      0.5633422     -0.2316257      0.4840611

The following are a few examples, for simulated and real data, illustrating the variable selection method described above.

  source("VAR-library.R")

  # # Generating sVAR(1) with dim=3
  # Tt <- 500
  # delta <- 1
  # Sigma <- matrix(c(1, delta/4, delta/4, delta/4, 1, 0, delta/4, 0, 1), ncol=3)
  # A <- matrix(c(.9, 0, .1, 0, -.3, 0, 0, 0, .2), ncol=3)
  # nrep <- 100
  # 
  # Asum <- matrix(0,3,3)
  # Afre <- matrix(0,3,3)
  # for(r in 1:nrep){
  #   # print(r)
  #   y <- VAR.sim(Tt, A, Sigma=Sigma)
  #   sVar.out <- sVAR.tstat(y, p=1)
  #   Asum <- Asum + sVar.out$hatA
  #   Afre <- Afre + 1*(sVar.out$hatA != 0)
  #   # print(sVar.out$hatA)
  # }
  # Amean <- Asum/nrep
  # Afre <- 100*Afre/nrep
  # # print(Amean)
  # # print(Afre)
  # save(A,Amean,Afre,file="VarSelectSimul.RData")
  

  load("VarSelectSimul.RData")
  library(corrplot)
  par(mfrow=c(1,3))
  corrplot(A, mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7,  addCoef.col=TRUE)
  title("True")
  corrplot(Amean, mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
  title("Average estimated A_1")
  corrplot(Afre, mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7, addCoef.col=TRUE)
  title("Frequency of non-zero estimates")

# ptm <- proc.time()
# stateflu.some <- stateflu[,c("CT","ME","MA","NH","RI","VT")]
# sVar.out <- sVAR.tstat(t(stateflu.some), p=2)
# ltm <- proc.time() - ptm
# save(sVar.out,ltm,file="VarSelect6States.RData")
  
load("VarSelect6States.RData")
ltm

##    user  system elapsed 
##  10.833   0.410  11.274

par(mfrow=c(1,2))
corrplot(sVar.out$hatA[,1:6], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7,  addCoef.col=TRUE)
corrplot(sVar.out$hatA[,7:12], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7,  addCoef.col=TRUE)

# ptm <- proc.time()
# stateflu.some2 <- stateflu[,c("CT","ME","MA","NH","RI","VT","NJ","NY")]
# sVar.out2 <- sVAR.tstat(t(stateflu.some2), p=2)
# ltm <- proc.time() - ptm
# save(sVar.out2,ltm,file="VarSelect8States.RData")
  
load("VarSelect8States.RData")
ltm

##    user  system elapsed 
##  44.440   2.743  47.432

par(mfrow=c(1,2))
corrplot(sVar.out2$hatA[,1:8], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7,  addCoef.col=TRUE)
corrplot(sVar.out2$hatA[,9:16], mar=c(1,1,2,1), is.corr=FALSE, cl.pos=FALSE, cl.cex = .5, tl.cex=.7,  addCoef.col=TRUE)

Note: The method is time consuming for larger \(d\), e.g. running it for 14 states took almost one hour.

Question: What can be done then for larger dimension \(d\)? At least two approaches:

Select a rough smaller VAR model first, and then apply the variable selection. E.g. a smaller model can be selected using the so-called Partial Spectral Coherences as in Davis et al. (2016), Baek et al. (2017).
Use penalized regression as explained next.

Penalized estimation: LASSO and its variants

LASSO (Least Absolute Shrinkage and Selection Operator) of Tibshirani (1996) estimates the coefficients through \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \| \mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}\|^2_2 + \lambda \sum_{r=1}^p \|\Phi_r\|_1, \] where \(\|\Phi_r\|_1 = \sum_{j,k=1}^d |\Phi_{r,jk}|\) is the \(\ell_1\)-penalty and \(\lambda\) is a tuning regularization/penalty parameter. There exists no closed-form solution for the LASSO estimator but the above optimization can be solved numerically in order to obtain the estimator. This is implemented through the R package glmnet, which allows for more general elastic-net penalty as in: \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \| \mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}\|^2_2 + \lambda \Big(\alpha \sum_{r=1}^p \|\Phi_r\|_1 + \frac{(1-\alpha)}{2} \sum_{r=1}^p \|\Phi_r\|_2^2 \Big), \] where \(\alpha =1\) (the default) corresponds to LASSO and \(\alpha=0\) to the Ridge regression.

In fact, there is an efficient algorithm (LARS; see Efron et al. (2004)) that produces the solutions to LASSO estimation for all \(\lambda\)’s, referred to as a LASSO solution path. The glmnet, on the othet hand, uses the coordinate discent.

Notes

The key feature of the \(\ell_1\)-penalty is that some of the parameter estimates will typically be equal to \(0\). The larger the penalty parameter \(\lambda\), the more parameter estimates will be set to \(0\).
LASSO is known to be “consistent” (both for model selection and parameter estimation) only under suitable assumptions, the so-called Irrepresentable, Restricted Eigenvalue, and other conditions. These essentially do not allow the regressors to be too correlated.
The tuning parameter \(\lambda\) is commonly chosen through Cross Validation.
A nice R Vignette on Glmnet can be found on Tibshirani’s website. The following example is an illustration on a small univariate AR model.

set.seed(2)
T <- 1000
p1 <- 0.4
p2 <- -0.3
p3 <- -0.1
y <- arima.sim(list(order = c(3,0,0), ar = c(p1,p2,p3)), n = T)
plot(y)

response <- y[5:T]
input <- cbind(y[4:(T-1)],y[3:(T-2)],y[2:(T-3)],y[1:(T-4)])

response <- (response - mean(response))/sqrt(var(response))  
library(matrixStats)
input <- (input - colMeans(input))/colSds(input)

library(glmnet)

fit <- glmnet(input, response)
plot(fit,xvar = "lambda",label = TRUE)

#print(fit)
#coef(fit,s=0.1)

cvfit <- cv.glmnet(input, response)
plot(cvfit)

cvfit$lambda.min

## [1] 0.01339785

coef(cvfit, s = "lambda.min")

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  0.001057139
## V1           0.396988092
## V2          -0.225247532
## V3          -0.110871611
## V4           .

Google flu trends: LASSO

# nf <- 10
# p <- 2 
# data <- t(stateflu)
# 
# y <- data - rowMeans(data)
# T1 <- dim(y)[2]-p
# k <- dim(y)[1]
# # create vector X1 and Y1
# X1 <- matrix(0, k*p, T1)
# Y1 <- matrix(0, k, T1)
# for(j in 1:T1){
#     # ar term
#     id <- seq(from= j+p-1, to = j, by=-1)
#     x <- as.vector(y[,id])
#     X1[,j] <- x
#     Y1[,j] <- y[,(j+p)]
#   }
# ty <- t(y)
# ty <- data.frame(ty)
# 
# y1 <- as.vector(Y1)
# x1 <- kronecker(t(X1), diag(1, k))
# 
# cvfit <- cv.glmnet(x1, y1, alpha = 1, intercept=TRUE, standardize=TRUE, type.measure = "mse", nfolds=nf)
# save(cvfit,k,file="LassoStand.RData")
# # runs 10 minutes
# # standardize=FALSE gives slightly different results visually

load("LassoStand.RData")
#load("LassoStandno.RData")

cf.cv <- coef(cvfit, s = "lambda.min")
cf.cv <- cf.cv[-1]
hA1 <- matrix(cf.cv, nrow=k, byrow=FALSE)

A1svar <- hA1[,1:51]
A2svar <- hA1[,52:102]

colnames(A1svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
colnames(A2svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")

rownames(A1svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A2svar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")

A1svarbyregn <-  A1svar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]

A2svarbyregn <-  A2svar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]


corrplot(round(A1svarbyregn,1), mar=c(1,1,2,1),  method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="sVAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")

corrplot(round(A2svarbyregn,1), mar=c(1,1,2,1),  method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="sVAR(2), A_2")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")

Improvements in interpretation

E.g. the cross-dependence in the first CDC region.

Improvements in forecasts

lag <- 2
cols <- dim(stateflu)[2]
rows <- dim(stateflu)[1] 
holdoutsize <- 10
trainsize <- rows - holdoutsize

stateflu.train <- stateflu[1:trainsize,]
stateflu.mean <- colMeans(stateflu.train)
fitvar <- VAR(stateflu.train - stateflu.mean, p = lag, type="none")

# nf <- 10
# p <- 2
# data <- t(stateflu.train)
# y <- data - rowMeans(data)
# T1 <- dim(y)[2]-p
# k <- dim(y)[1]
# # create vector X1 and Y1
# X1 <- matrix(0, k*p, T1)
# Y1 <- matrix(0, k, T1)
# for(j in 1:T1){
#      # ar term
#      id <- seq(from= j+p-1, to = j, by=-1)
#      x <- as.vector(y[,id])
#      X1[,j] <- x
#      Y1[,j] <- y[,(j+p)]
#    }
#  ty <- t(y)
#  ty <- data.frame(ty)
# 
#  y1 <- as.vector(Y1)
#  x1 <- kronecker(t(X1), diag(1, k))
# 
#  cvfit <- cv.glmnet(x1, y1, alpha = 1, intercept=TRUE, standardize=TRUE, type.measure = "mse", nfolds=nf)
#  save(cvfit,k,file="LassoStandNoTrain.RData")

load("LassoStandNoTrain.RData")
hatA0 <- Bcoef(fitvar)
 
cf.cv <- coef(cvfit, s = "lambda.min")
cf.cv <- cf.cv[-1]
hatA1 <- matrix(cf.cv, nrow=k, byrow=FALSE)

sum(hatA0 != 0)

## [1] 5202

sum(hatA1 != 0)

## [1] 1974

ff0 <- ff1 <- matrix(0, 51, holdoutsize) 
for( i in 1:holdoutsize){
train <- t(stateflu[1:(trainsize+i-1),])
mf <- rowMeans(train)
train <- train - mf
ff0[,i] <- VAR.forecast(train, h=1, hatA0) + mf
ff1[,i] <- VAR.forecast(train, h=1, hatA1) + mf
}
 
## 1-step out-of-sample forecast MSPE 
YTF <- t(stateflu[-(1:trainsize),])

sum((YTF - ff0)^2)

## [1] 14.22553

sum((YTF - ff1)^2)

## [1] 7.913374

plot.ts(YTF[1,], col=1)
lines(ff0[1,], col=2)
lines(ff1[1,], col=3)
legend("bottomleft", c("TRUE", "hatA0", "hatA1"), lty=c(1,1,1), col= c(1,2,3))
title("AL")

LASSO variants

It is quite well known that regular LASSO overestimates the number of non-zero coefficients. Adaptive LASSO (A-LASSO) tends to correct this tendency. A-LASSO estimation of Zou (2006) is defined as \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \| \mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}\|^2_2 + \lambda\sum_{r=1}^p \sum_{j,k=1}^d \widehat w_{r,jk} |\Phi_{r,jk}|, \] where e.g. supposing a VAR model can be fitted through OLS (i.e. the number of parameters being smaller than the number of observations), \[ \widehat w_{r,jk} = \frac{1}{|\widehat \Phi_{r,jk}^{ols}|}. \] In fact, the A-LASSO version which I typically use in my work is based on \[ (\widehat \Phi_1,\ldots,\widehat \Phi_p) = \mbox{argmin}_{\Phi_1,\ldots,\Phi_p} \frac{1}{2T} \sum_{t=p+1}^T \mbox{vec}(\mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p})' \Sigma_{\mathbf Z} \mbox{vec}(\mathbf{X}_t - \Phi_1 \mathbf{X}_{t-1} - \ldots - \Phi_p \mathbf{X}_{t-p}) \] \[ + \lambda\sum_{r=1}^p \sum_{j,k=1}^d \widehat w_{r,jk} |\Phi_{r,jk}|, \] which is implemented numerically by iteration over plugging in the estimated \(\widehat \Sigma_{\mathbf Z}\) till convergence. (The same or similar objective functions are used for regular LASSO as well.) It is known that in “lower dimensions” and VARs, the A-LASSO estimates the correct model asymptotically with increasing sample size. In particular, no additional assumptions as in LASSO are needed.

Improvements in forecasts: A-LASSO

# The saved A-LASSO estimates were obtained using the A-LASSO function in the library VAR-library.R
load("LassoAStandNoTrain.RData")
hatA2 <- fitsvar$hatA

sum(hatA2 != 0)

## [1] 545

A1savar <- hatA2[,1:51]
A2savar <- hatA2[,52:102]

colnames(A1savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
colnames(A2savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")

rownames(A1savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")
rownames(A2savar) <- c("AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY")

A1savarbyregn <-  A1savar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]

A2savarbyregn <-  A2savar[c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA"),c("CT","ME","MA","NH","RI","VT","NJ","NY","DE","DC","MD","PA","VA","WV","AL","FL","GA","KY","MS","NC","SC","TN",
"IL","IN","MI","MN","OH","WI","AR","LA","NM","OK","TX","IA","KS","MO","NE","CO","MT","ND","SD","UT","WY","AZ","CA","HI","NV",
"AK","ID","OR","WA")]



corrplot(round(A1savarbyregn,1), mar=c(1,1,2,1),  method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="saVAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")

corrplot(round(A2savarbyregn,1), mar=c(1,1,2,1),  method="color", is.corr=FALSE, cl.pos=FALSE, addCoef.col=TRUE, tl.cex=.7, cl.cex=0.5, number.cex=0.5,title="saVAR(2), A_1")
abline(h = 0, v = c(6.5,8.5,14.5,22.5,28.5,33.5,37.5,43.5,47.5,51.5), col = "black")
abline(h = c(4.5,8.5,14.5,18.5,23.5,29.5,37.5,43.5,45.5,51.5), v = 0, col = "black")

ff2 <- matrix(0, 51, holdoutsize)
for( i in 1:holdoutsize){
  train <- t(stateflu[1:(trainsize+i-1),])
  mf <- rowMeans(train)
  train <- train - mf;
  ff2[,i] <- VAR.forecast(train, h=1, hatA2) + mf
}

sum((YTF - ff2)^2)

## [1] 7.138015

plot.ts(YTF[1,], col=1)
lines(ff0[1,], col=2)
lines(ff1[1,], col=3)
lines(ff2[1,], col=4)
legend("bottomleft", c("TRUE","hatA0","hatA1","hatA2"), lty=c(1,1,1,1), col= c(1,2,3,4))
title("AL")

Some final notes

The VAR analysis of the Google flu trends above somewhat follows Davis et al. (2016). My own involvment in the analysis was through Baek et al. (2017) where we showed that sparse seasonal/periodic VAR models do better than regular VARs.
Despite some progress in penalized estimation for VARs, the methods are somewhat in their infancy. E.g. no (good) formal significance tests are available, nor confidence intervals, nor satisfactory understanding (practically speaking) when LASSO works in higher dimensions, etc.

Questions/Homework

Q0: Supplement your analysis of multivariate time series with variable selection, LASSO and its variants where seem appropriate.

Q1: Explain how exactly restricted VARs are estimated.

Q2: Compare numerically how the variable selection method described above for smaller dimension compares numerically to stepwise regression methods.

Q3: Look into some of the conditions (Irrepresentable, Restricted Eigenvalue, etc) for LASSO to work.

Q4: Explain how the coordinate descent method is used to obtain the LASSO and related estimators.

References

Baek, C., R. A. Davis, and V. Pipiras. 2017. “Sparse Seasonal and Periodic Vector Autoregressive Modeling.” Computational Statistics & Data Analysis 106: 103–26. doi:10.1016/j.csda.2016.09.005.

Davis, R. A., P. Zang, and T. Zheng. 2016. “Sparse Vector Autoregressive Modeling.” Journal of Computational and Graphical Statistics 25 (4): 1077–96. doi:10.1080/10618600.2015.1092978.

Dugas, A. F., M. Jalalpour, Y. Gel, S. Levin, F. Torcaso, T. Igusa, and R. E. Rothman. 2013. “Influenza Forecasting with Google Flu Trends.” PloS One 8 (2): e56176.

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani. 2004. “Least Angle Regression.” The Annals of Statistics 32 (2): 407–99.

Ginsberg, J., M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, and L. Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457: 1012–4. http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html.

Hamilton, J. D. 1994. Time Series Analysis. Princeton University Press, Princeton, NJ.

Lazer, D., R. Kennedy, G. King, and A. Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (14 March).

Lütkepohl, H. 2005. New Introduction to Multiple Time Series Analysis. Springer-Verlag, Berlin.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B. Methodological 58 (1): 267–88.

Wasserman, L. 2004. All of Statistics. Springer Texts in Statistics. Springer-Verlag, New York. doi:10.1007/978-0-387-21736-9.

Zheng, X., and W.-Y. Loh. 1995. “Consistent Variable Selection in Linear Models.” Journal of the American Statistical Association 90 (429): 151–56.

Zou, H. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29.