Multivariate VI: Time Series Clustering/Classification

What is this all about?

Clustering is the process of grouping like objects together so that objects within a cluster are more similar to each other than to objects of other clusters. In general, this can be performed on any kind of data (e.g. non-time series ‘static’ data), but applying it to time series involves several unique challenges. Namely, time series data often have a large degree of dependence within a single series, and time series clustering tends to be of a much higher dimension than typical static clustering.

There are three components necessary to most clustering approaches:

A measure of similarity or dissimilarity, i.e. a distance measure
A clustering algorithm
A method to evaluate clusters

This lecture will focus primarily on 1 and 2. For 3, see Liao (2005).

Classification

A related problem to clustering is that of classification. The only difference here is that clusters have an interpretable meaning, and the process is to classify objects appropriately. The first data example below can be readily understood in a classification context.

Setting

Suppose we are given \(N\) time series objects \(\mathbf{X}_{1,t},\ldots,\mathbf{X}_{N,t}\), \(t = 1,\ldots,T\). We will consider both univariate and multivariate time series objects.

Note: Most similarity/distance measures require each time series object to have the same length/dimensionality. However, there are typically ways around this constraint, e.g. most simply by time interpolation.

Data for this lecture

We consider two datasets: one in the sphere of signal processing, and the other more in the spirit of traditional time series analysis.

Motivating Example 1: Character Trajectories

The horizontal velocity of the pen tip of when writing letters of the alphabet. Source: Lichman (2013)

# install.packages('dtwclust')
library(dtwclust)
CharTrajLabels

##   [1] A A A A A B B B B B C C C C C D D D D D E E E E E G G G G G H H H H H
##  [36] L L L L L M M M M M N N N N N O O O O O P P P P P Q Q Q Q Q R R R R R
##  [71] S S S S S U U U U U V V V V V W W W W W Y Y Y Y Y Z Z Z Z Z
## Levels: A B C D E G H L M N O P Q R S U V W Y Z

par(mfrow=c(2,2))

plot.ts(CharTraj$Z.V1,main='Letter Z',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$Z.V2,main='Letter Z',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$O.V1,main='Letter O',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$O.V2,main='Letter O',ylab='X Velocity',xlab='Time')

Motivating Example 2: Temperature and precipitation in Washington state

After cleaning, this data represents all daily maximum temperatures, daily minimum temperatures, and an indicator for daily precipitation at 22 climate stations in Washington for all days from 01/01/2005 to 12/31/2007. Source: NCDC/NOAA.

library(readr)
X901720 <- read_csv("901720.csv")

attach(X901720)

table2 <- X901720[X901720$PRCP > -1,]
table2 <- table2[table2$TMAX > -1,]
table2 <- table2[table2$TMIN > -1 & table2$TMIN < 1000,]
table2 <- table2[table2$PRCP > -1,]

table3 = NULL
i = 0
for (c in unique(STATION)) {
  if (dim(table2[table2$STATION == c,])[1] == 365*3) {
    table3 = rbind(table3, table2[table2$STATION == c,])
    i = i+1
    # print(c(c,i))
  }
}
attach(table3)

tmax = TMAX/(max(TMAX) - min(TMAX)) #Standardized 
tmax = tmax - min(tmax)
tmin = TMIN/(max(TMIN) - min(TMIN)) #Standardized
tmin = tmin - min(tmin)
prcp = as.numeric(PRCP>0) #Made binary

stat = STATION
stat_un = unique(STATION)
lat = LATITUDE
long = LONGITUDE

yr3 = 365*3

par(mfrow=c(2,1))
ts.plot(tmax[1:yr3],ylim=c(0,1),ylab="temp",xlab="Day",col="blue",main=stat_un[1])
points(prcp[1:yr3],cex=0.2)
lines(tmin[1:yr3],ylim=c(0,1),xlab="Day",lty=2,col="orange")

ts.plot(tmax[(yr3+1):(2*yr3)],ylim=c(0,1),ylab="temp",xlab="Day",col="blue",main=stat_un[2])
points(prcp[(yr3+1):(2*yr3)],cex=0.2)
lines(tmin[(yr3+1):(2*yr3)],ylim=c(0,1),xlab="Day",lty=2,col="orange")

Similarity/Distance Measures

Clustering involves grouping objects together through a measure of similarity. This need not always be a distance, strictly speaking (e.g. both similarity measures defined below do not satisfy the triangle inequality), so long as it can be used to create a pairwise similarity matrix between all objects. A naive approach which often performs decently in practice is to simply use Euclidean (or \(L_1\), etc.) distance.

We look at two others in particular:

Dynamic Time Warping (DTW) for univariate time series
\(Q^*\), a quasi-distance based on the spectral densities for multivariate time series

Many similarity measures can be found in the literature and can be mixed and matched with different clustering algorithms.

See Aghabozorgi (2015)

DTW

Dynamic Time Warping (DTW) can be thought of as an alignment algorithm. The basic idea is to match the shapes of the time series while allowing for different time speed by computing an optimum ‘warping curve’. It is slower and more computationally intensive than Euclidean distance (running at \(O(T^2)\)) but tends to be more robust. It is most easily understood visually.

See Sarda-Espinosa (2016)

To compute a DTW distance \(DTW(X,Y)\) between two univariate time series \(X_{t}\) and \(Y_{t}\), \(t = 1,...,T\), first compute a \(T\times T\) “local cost matrix” (LCM) which stores all pairwise (e.g. Euclidean) distances from \(X_{t_1}\) to \(Y_{t_2}\), \((t_1,t_2) \in T \times T\).

Then, step through the matrix from \(\text{LCM}_{11}\) to \(\text{LCM}_{TT}\), either diagonally (\((i,j) \to (i+1,j+1)\)) or vertically/horizontally (\((i,j) \to (i+1,j)\) or \((i,j) \to (i,j+1)\)), adding the costs of each step.

At each step, the goal is to find the direction which increases the cost the least. A pre-specified step pattern assigns weights to each direction. Such step patterns may also allow bigger “jumps”, e.g. \((i,j) \to (i+2,j+1)\).

For an optimum path \(\phi = \{(1,1),...,(T,T)\}\), the DTW distance is computed as

\[DTW(X,Y) = \left(\sum_{k \in \phi} \dfrac{m_\phi(k)\text{LCM}(k)^2}{M_\phi}\right)^{1/2},\] where \(m_\phi(k)\) is a per-step weighting coefficient determined by the step pattern, and \(M_\phi\) normalizes the aggregate cost to be average-per-step. Sometimes, global (“window”) constraints are imposed so that certain regions of the LCM are not reachable. See Giorgino (2009).

# Series are not all the same length, so they must be reinterpolated:
Chars <- reinterpolate(CharTraj, new.length = max(lengths(CharTraj)))

# Calculate the distance matrices for a few different methods 
dist_DTW <- dist(Chars, Chars, method = "dtw_basic", window.size=20)
dist_Euc <- dist(Chars, Chars, method = "euclidean")

diag(dist_DTW) <- NA
diag(dist_Euc) <- NA

dist_DTW[1:5,6:10]

##          B.V1     B.V2     B.V3     B.V4     B.V5
## A.V1 57.37716 41.68769 53.73263 64.34641 39.68635
## A.V2 57.98981 39.83564 50.59557 58.91198 37.88308
## A.V3 56.62815 41.98002 51.02160 47.35270 26.94774
## A.V4 64.27002 41.05958 48.90285 55.79323 34.70193
## A.V5 70.47508 45.29957 53.48439 64.06912 35.61516

dist_Euc[1:5,6:10]

##          B.V1     B.V2     B.V3     B.V4     B.V5
## A.V1 6.672680 6.891274 7.659604 7.121019 6.061756
## A.V2 6.825624 6.768158 7.608874 6.993452 6.216782
## A.V3 6.619865 6.314050 7.096598 6.274200 5.847371
## A.V4 6.920426 6.865082 7.550229 7.072161 6.224095
## A.V5 7.526249 7.598928 8.393165 7.796458 6.820037

# Nearest neighbors
NN_DTW <- apply(dist_DTW, 1, which.min)
NN_Euc <- apply(dist_Euc, 1, which.min)

NN_DTW #It looks like A.V5 is closest to H.V1!

## A.V1 A.V2 A.V3 A.V4 A.V5 B.V1 B.V2 B.V3 B.V4 B.V5 C.V1 C.V2 C.V3 C.V4 C.V5 
##    2    1    4    5   31    7    8    7    7    7   14   13   12   15   14 
## D.V1 D.V2 D.V3 D.V4 D.V5 E.V1 E.V2 E.V3 E.V4 E.V5 G.V1 G.V2 G.V3 G.V4 G.V5 
##   17   18   20   20   19   25   21   24   23   23   29   52   30   26   28 
## H.V1 H.V2 H.V3 H.V4 H.V5 L.V1 L.V2 L.V3 L.V4 L.V5 M.V1 M.V2 M.V3 M.V4 M.V5 
##    5   34   34   32   33   39   39   39   37   37   87   43   44   43   44 
## N.V1 N.V2 N.V3 N.V4 N.V5 O.V1 O.V2 O.V3 O.V4 O.V5 P.V1 P.V2 P.V3 P.V4 P.V5 
##   49   48   47   48   47   52   51   55   51   52   51   58   57   60   59 
## Q.V1 Q.V2 Q.V3 Q.V4 Q.V5 R.V1 R.V2 R.V3 R.V4 R.V5 S.V1 S.V2 S.V3 S.V4 S.V5 
##   63   64   64   62   97   84   70   70   67   67   72   71   75   71   73 
## U.V1 U.V2 U.V3 U.V4 U.V5 V.V1 V.V2 V.V3 V.V4 V.V5 W.V1 W.V2 W.V3 W.V4 W.V5 
##   32   79   80   78   78   82   81   82   66   82   88   88   87   87   89 
## Y.V1 Y.V2 Y.V3 Y.V4 Y.V5 Z.V1 Z.V2 Z.V3 Z.V4 Z.V5 
##   92   91   91   91   94   99   18   99  100   99

CharTrajLabels[31]

## [1] H
## Levels: A B C D E G H L M N O P Q R S U V W Y Z

par(mfrow=c(2,2))
plot.ts(CharTraj$A.V5,main='Letter A.V5',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$A.V4,main='Letter A.V4',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$H.V1,main='Letter H.V1',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$H.V4,main='Letter H.V4',ylab='X Velocity',xlab='Time')

# U.V1 also looks a lot like H.V2
par(mfrow=c(1,2))
plot.ts(CharTraj$H.V2,main='Letter H.V2',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$U.V1,main='Letter U.V1',ylab='X Velocity',xlab='Time')

NN_Euc

## A.V1 A.V2 A.V3 A.V4 A.V5 B.V1 B.V2 B.V3 B.V4 B.V5 C.V1 C.V2 C.V3 C.V4 C.V5 
##    2    1    2    5    4    7    8    7    8    9   14   15   12   12   12 
## D.V1 D.V2 D.V3 D.V4 D.V5 E.V1 E.V2 E.V3 E.V4 E.V5 G.V1 G.V2 G.V3 G.V4 G.V5 
##   20   20   20   20   19   22   21   24   23   23   29   29   30   26   28 
## H.V1 H.V2 H.V3 H.V4 H.V5 L.V1 L.V2 L.V3 L.V4 L.V5 M.V1 M.V2 M.V3 M.V4 M.V5 
##   33   34   31   32   33   39   39   37   37   38   90   43   44   43   44 
## N.V1 N.V2 N.V3 N.V4 N.V5 O.V1 O.V2 O.V3 O.V4 O.V5 P.V1 P.V2 P.V3 P.V4 P.V5 
##   49   48   47   48   49   55   54   54   53   51   29   58   60   60   58 
## Q.V1 Q.V2 Q.V3 Q.V4 Q.V5 R.V1 R.V2 R.V3 R.V4 R.V5 S.V1 S.V2 S.V3 S.V4 S.V5 
##   62   64   64   63   85   84   69   67   70   69   75   75   75   75   74 
## U.V1 U.V2 U.V3 U.V4 U.V5 V.V1 V.V2 V.V3 V.V4 V.V5 W.V1 W.V2 W.V3 W.V4 W.V5 
##    5   78   80   80   78   82   81   82   66   68   87   86   89   88   89 
## Y.V1 Y.V2 Y.V3 Y.V4 Y.V5 Z.V1 Z.V2 Z.V3 Z.V4 Z.V5 
##   92   91   91   95   94   99   66   25  100   99

# Now, U.V1 and A.V5 are close
par(mfrow=c(1,2))
plot.ts(CharTraj$A.V5,main='Letter A.V5',ylab='X Velocity',xlab='Time')
plot.ts(CharTraj$U.V1,main='Letter U.V1',ylab='X Velocity',xlab='Time')

Spectrum-based similarity

The main challenge in clustering stationary multivariate time series is defining a similarity measure. Here, we follow such an approach by Ravishanker et al. (2010) defining a similarity measure using the smoothed periodogram estimates of the spectral matrices for two stationary multivariate time series.

The approach is based on the following likelihood ratio test statistic. Consider the null hypothesis

\[H_0: \mathbf{\lambda_X}(\omega) = \mathbf{\lambda_Y}(\omega),\]

where \(\mathbf{\lambda_X},\mathbf{\lambda_Y}\) are the spectral density matrices of (multivariate) time series \(\{\mathbf{X}_t\},\{\mathbf{Y}_t\}\). Note: This formulation assumes the variances are the same, so we will standardize the data. It is also possible to consider a likelihood ratio test where \(\lambda_X\) is a constant multiple of \(\lambda_Y\).

The likelihood functions \(L(\mathbf{\lambda_X}(\omega), L(\mathbf{\lambda_Y}(\omega))\) and \(L_0(\mathbf{\lambda}(\omega))\) (the latter assuming \(H_0\)) can be derived using a smoothed periodogram (Daniell window) \(\mathbf{I^*}(\omega_l)\). For example, the spectral density matrix \(\lambda_X(\omega)\) at \(\omega = \omega_l = \frac{2\pi l}{T}\) is approximated through a smoothed periodogram

\[\mathbf{I^*}_\mathbf{X}(\omega_k) = (2m + 1)^{-1}\sum\limits_{|k| \leq m}\mathbf{I}_\mathbf{X}(\omega_{l+k}),\]

where \((2m+1)\) is a positive integer related to the degrees of freedom of a complex Wishart distribution, and \(\mathbf{I}_\mathbf{X}(\omega)\) is the periodogram of \(\{\mathbf{X}_t\}\) with entry \((j,k)\) given by

\[ I_{\mathbf{X},jk}(\omega_l) = \frac{1}{2\pi T}\left( \sum\limits_{t=1}^{T}x_{tj}\exp(-it\omega_l) \right)\left( \sum\limits_{t=1}^{T}x_{tj}\exp(it\omega_l) \right).\]

It is known that

\[\mathbf{I^*}_\mathbf{X}(\omega) \overset{d}{\approx} \frac{1}{2m+1}W_c(2m+1,p,\lambda_\mathbf{X}(\omega)),\]

and hence the density of \(\mathbf{I^*}_\mathbf{X}(\omega) =: \mathbf{Z}\) is approximately

\[\dfrac{1}{\Gamma_p(2m+1)}|\mathbf{Z}|^{(2m+1)-p}|\mathbf{\lambda_X}(\omega)|^{-(2m+1)}\exp\left\{-\text{tr}\left(\mathbf{\lambda_X}^{-1}(\omega)\mathbf{Z}\right)\right\}.\]

The likelihood is then

\[L(\mathbf{\lambda_X}(\omega)) = \dfrac{1}{\Gamma_p(2m+1)}|\mathbf{I^*}_\mathbf{X}(\omega)|^{(2m+1)-p}|\mathbf{\lambda_X}(\omega)|^{-(2m+1)}\exp\left\{-\text{tr}\left(\mathbf{\lambda_X}^{-1}(\omega)\mathbf{I^*}_\mathbf{X}(\omega)\right)\right\}.\]

Its maximum is

\[L(\mathbf{\hat{\lambda}_X}(\omega)) = \dfrac{1}{\Gamma_p(2m+1)}\bigg|\frac{1}{2m+1}\mathbf{I^*}_\mathbf{X}(\omega)\bigg|^{-(2m+1)}\bigg|\mathbf{I^*}_\mathbf{X}(\omega)\bigg|^{(2m+1)-p}\exp\left\{-(2m+1)p\right\}.\]

See Conradsen et al. (2003). After similar calculations for the likelihood under \(H_0\), the likelihood ratio test is

\[Q_{\mathbf{XY}}(\omega) = \frac{L_0(\mathbf{\hat{\lambda}}(\omega))}{L(\mathbf{\hat{\lambda}_X}(\omega))L(\mathbf{\hat{\lambda}_Y}(\omega))} = 2^{2p(2m+1)}\frac{|\mathbf{I^*}_\mathbf{X}(\omega)|^{2m+1}|\mathbf{I^*}_\mathbf{Y}(\omega)|^{2m+1}}{|\mathbf{I^*}_\mathbf{X}(\omega)+\mathbf{I^*}_\mathbf{Y}(\omega)|}^{2(2m+1)}.\]

The quasi-distance \(Q_{\mathbf{XY}}^*\)is defined as the average of the \(M\) smallest values of the likelihood ratio

\[Q_{\mathbf{XY}}^* = \frac{1}{M}\sum_{j=1}^M Q_{\mathbf{XY}}(\omega_{(j)}),\]

where \(M\) is chosen large enough to provide some smoothing of \(Q(\omega_k)\) without being too large to miss details of the spectra. A guideline is to take \(\sqrt{T} \leq M \leq T/10\). The smallest values of the likelihood ratio correspond to values which provide evidence against \(H_0\), but taking the \(M\) largest values of \(Q(\omega_l)\) also seems to work well.

N = length(stat_un)
T = 3*365 
P = N

data = NULL
for (i in 1:N) {
  X = rbind(tmax[stat==stat_un[i]],tmin[stat==stat_un[i]],prcp[stat==stat_un[i]])
  data = rbind(data,X)
}

# standardizing all series
data2 = t(data)
data2 = (data2 - matrix(rep(1,T),ncol=1)%*%apply(data2,2,mean)) / (matrix(rep(1,T),ncol=1)%*%apply(data2,2,sd)) 

# computing periodograms and crossperiodograms
data2_fft = mvfft(data2)
data2_fft = data2_fft[2:ceiling(T/2),]

data2_per = (1/2/pi/T)*data2_fft*Conj(data2_fft)
data2_per = abs(data2_per)

data2_cper = NULL
for (p in 1:P) {
  cper_new = (1/2/pi/T)*data2_fft[,2*p-1]*Conj(data2_fft[,2*p])
  data2_cper = cbind(data2_cper, cper_new)
}

T2 = dim(data2_per)[1]
P2 = dim(data2_per)[2]

T3 = dim(data2_cper)[1]
P3 = dim(data2_cper)[2]

# smoothing periodograms and crossperiodograms
m = 10 ;

data2_per_ud = apply(data2_per,2,rev) ;
data2_per_2 = rbind( rep(0,P2), data2_per_ud[(T2-m):(T2-1),], data2_per, data2_per_ud[2:(m+1),] ) 

data2_per_sm = apply(data2_per_2,2,cumsum) 
data2_per_sm = data2_per_sm[(2*m+2):dim(data2_per_sm)[1],] - data2_per_sm[1:(dim(data2_per_sm)[1]-2*m-1),] 
data2_per_sm = data2_per_sm/(2*m+1) 

data2_cper_ud = apply(data2_cper,2,rev) 
data2_cper_2 = rbind( rep(0,P3), data2_cper_ud[(T3-m):(T3-1),], data2_cper, data2_cper_ud[2:(m+1),] ) 

data2_cper_sm = apply(data2_cper_2,2,cumsum) 
data2_cper_sm = data2_cper_sm[(2*m+2):dim(data2_cper_sm)[1],] - data2_cper_sm[1:(dim(data2_cper_sm)[1]-2*m-1),] 
data2_cper_sm = data2_cper_sm/(2*m+1) 

Q = NULL ;
for (q in 1:(P-1)) {
  for (p in (q+1):P) {
    Q_new1 = abs(data2_per_sm[,2*p-1] * data2_per_sm[,2*p] - abs(data2_cper_sm[,p])^2) 
    Q_new2 = abs(data2_per_sm[,2*q-1] * data2_per_sm[,2*q] - abs(data2_cper_sm[,q])^2) 
    Q_new12 = abs( (data2_per_sm[,2*p-1] + data2_per_sm[,2*q-1]) * 
                     (data2_per_sm[,2*p] + data2_per_sm[,2*q]) - 
                     abs(data2_cper_sm[,p] + data2_cper_sm[,q])^2) 
    Q_new = (Q_new1 * Q_new2) / (Q_new12^2)        
    Q_new = 2^(2*2)*Q_new 
    Q = cbind(Q, Q_new) 
  }
}
Q_st = Q ;

M = T/10 ;

par(mfrow=c(2,1))
plot(Q_st[,1],main="Q_XY(omega_l) for stations 1,2",ylab="Q_XY(omega)")
lines(Q_st[,1])
smallest1 <- sort(Q_st[,1],index.return=TRUE)$ix[1:M]
points(smallest1,Q_st[smallest1,1],col="orange")

plot(Q_st[,(2*P-2)],main="Q_XY(omega_l) for stations 3,4",ylab="Q_XY(omega)")
lines(Q_st[,(2*P-2)])
smallest2 <- sort(Q_st[,(2*P-2)],index.return=TRUE)$ix[1:M]
points(smallest2,Q_st[smallest2,(2*P-2)],col="orange")

mean(smallest1)

## [1] 311.7706

mean(smallest2)

## [1] 248.7615

Clustering algorithms (for time series objects)

Algorithms for clustering time series can typically be divided into several categories. The two most common approaches are the following:

Partitional clustering
Hierarchical clustering

In partitional clustering, the number of clusters \(k\) must be specified in advance. Additionally, prototyping is required at each step – that is, a prototype is constructed to capture the important features of each cluster. In this instance, a prototype is typically computed as a centroid, i.e. the average distance from of all elements currently in that cluster.

At each step in the algorithm, all objects in the original set are matched to the cluster represented by their closest centroid, and the centroids are updated. Initial clusters/centroids can be chosen randomly from \(k\) of the original objects. The algorithm continues until some kind of convergence is achieved.

Each step of the algorithm requires computing distances for all objects to each centroid. In practice, this may be too computationally intensive, so most implementations will use some kind of greedy algorithm and run several times with different initial centroids.

Hierarchical clustering

The idea of hierarchical clustering is to construct a (binary) tree where the leaves are the objects one is clustering, and the children of each node are more similar to each other than to children of any other (non-descendant) node.

Hierarchical clustering can be done “divisively” (all objects start together and get split up) or “agglomeratively” (each object begins on a separate branch, then branches are merged). In the latter case, each step involves computing similarity between all cluster pairs; the two most similar clusters are then merged with branch length proportional to their similarity. The process repeats until reaching the root of the tree. See e.g. Ch. 14 in Hastie, et al. (2008)

Similarity between clusters is defined in one of several ways. For instance, single linkage takes the similarity between clusters to be that of the closest pair. For a similarity measure \(d\), this would be

\[ d_{SL}(C_1,C_2) = \underset{X \in C_1, Y \in C_2}{\min}d(X,Y).\]

Complete linkage takes the similarity between clusters to be that of the furthest pair:

\[ d_{CL}(C_1,C_2) = \underset{X \in C_1, Y \in C_2}{\max}d(X,Y).\]

Data applications

Example 1: Character Trajectories

# Hierarchical clustering
hc_DTW <- dtwclust(CharTraj, type = "hierarchical", method="complete", k = 20, distance = "dtw")
hc_Euc <- dtwclust(Chars, type = "hierarchical", method="complete", k = 20, distance = "Euclidean")

hc_DTW2 <- dtwclust(CharTraj, type = "hierarchical", method="single", k = 20, distance = "dtw")
hc_Euc2 <- dtwclust(Chars, type = "hierarchical", method="single", k = 20, distance = "Euclidean")

# Results (dendrograms)
hc_DTW@cluster

## A.V1 A.V2 A.V3 A.V4 A.V5 B.V1 B.V2 B.V3 B.V4 B.V5 C.V1 C.V2 C.V3 C.V4 C.V5 
##    1    1    1    1    1    2    2    2    2    2    3    4    4    4    4 
## D.V1 D.V2 D.V3 D.V4 D.V5 E.V1 E.V2 E.V3 E.V4 E.V5 G.V1 G.V2 G.V3 G.V4 G.V5 
##    5    6    6    5    5    7    7    7    7    7    8    9    8    8    8 
## H.V1 H.V2 H.V3 H.V4 H.V5 L.V1 L.V2 L.V3 L.V4 L.V5 M.V1 M.V2 M.V3 M.V4 M.V5 
##    1    1    1    1    1    3    3    3    3    3   10   10   11   11   11 
## N.V1 N.V2 N.V3 N.V4 N.V5 O.V1 O.V2 O.V3 O.V4 O.V5 P.V1 P.V2 P.V3 P.V4 P.V5 
##   12   13   13   12   13    9    9    9    9    9   14    9    9   15   15 
## Q.V1 Q.V2 Q.V3 Q.V4 Q.V5 R.V1 R.V2 R.V3 R.V4 R.V5 S.V1 S.V2 S.V3 S.V4 S.V5 
##   15   15   15   15   15   16   16   16   16   16    9    9    9    9    9 
## U.V1 U.V2 U.V3 U.V4 U.V5 V.V1 V.V2 V.V3 V.V4 V.V5 W.V1 W.V2 W.V3 W.V4 W.V5 
##    1   13   13   13   13   17   17   17   17   17    1    1    1    1    1 
## Y.V1 Y.V2 Y.V3 Y.V4 Y.V5 Z.V1 Z.V2 Z.V3 Z.V4 Z.V5 
##   18   18   18   18   18   19   19   19   20   20

plot(hc_DTW,main="DTW, complete linkage")

plot(hc_DTW, type = "sc")

hc_Euc@cluster

## A.V1 A.V2 A.V3 A.V4 A.V5 B.V1 B.V2 B.V3 B.V4 B.V5 C.V1 C.V2 C.V3 C.V4 C.V5 
##    1    1    1    1    1    2    2    2    2    2    3    3    3    3    3 
## D.V1 D.V2 D.V3 D.V4 D.V5 E.V1 E.V2 E.V3 E.V4 E.V5 G.V1 G.V2 G.V3 G.V4 G.V5 
##    4    4    4    4    4    5    5    5    5    5    6    6    7    6    7 
## H.V1 H.V2 H.V3 H.V4 H.V5 L.V1 L.V2 L.V3 L.V4 L.V5 M.V1 M.V2 M.V3 M.V4 M.V5 
##    1    1    1    1    1    3    8    8    8    8    9   10   10   10   10 
## N.V1 N.V2 N.V3 N.V4 N.V5 O.V1 O.V2 O.V3 O.V4 O.V5 P.V1 P.V2 P.V3 P.V4 P.V5 
##   11   11   11   11   11    6    6    6    6    6    6    2    2    2    2 
## Q.V1 Q.V2 Q.V3 Q.V4 Q.V5 R.V1 R.V2 R.V3 R.V4 R.V5 S.V1 S.V2 S.V3 S.V4 S.V5 
##   12   12   12   12   13   14    9    9    9    9    6    6    6    6    6 
## U.V1 U.V2 U.V3 U.V4 U.V5 V.V1 V.V2 V.V3 V.V4 V.V5 W.V1 W.V2 W.V3 W.V4 W.V5 
##    1    1    1    1    1    9    9    9   14    9    9    9    9    9    9 
## Y.V1 Y.V2 Y.V3 Y.V4 Y.V5 Z.V1 Z.V2 Z.V3 Z.V4 Z.V5 
##   15   15   15   16   16   17   18   19   20   20

plot(hc_Euc,main="Euclidean, complete linkage")

hc_DTW2@cluster

## A.V1 A.V2 A.V3 A.V4 A.V5 B.V1 B.V2 B.V3 B.V4 B.V5 C.V1 C.V2 C.V3 C.V4 C.V5 
##    1    1    1    1    1    2    3    3    3    3    1    1    1    1    1 
## D.V1 D.V2 D.V3 D.V4 D.V5 E.V1 E.V2 E.V3 E.V4 E.V5 G.V1 G.V2 G.V3 G.V4 G.V5 
##    4    5    6    7    7    8    8    8    8    8    3    3    3    3    3 
## H.V1 H.V2 H.V3 H.V4 H.V5 L.V1 L.V2 L.V3 L.V4 L.V5 M.V1 M.V2 M.V3 M.V4 M.V5 
##    1    1    1    1    1    1    1    1    1    1    9   10   11   11   12 
## N.V1 N.V2 N.V3 N.V4 N.V5 O.V1 O.V2 O.V3 O.V4 O.V5 P.V1 P.V2 P.V3 P.V4 P.V5 
##   13    1    1    1    1    3    3    3    3    3   14    3    3   15   15 
## Q.V1 Q.V2 Q.V3 Q.V4 Q.V5 R.V1 R.V2 R.V3 R.V4 R.V5 S.V1 S.V2 S.V3 S.V4 S.V5 
##   16   16   16   16   16    1    1    1    1    1    3    3    3    3    3 
## U.V1 U.V2 U.V3 U.V4 U.V5 V.V1 V.V2 V.V3 V.V4 V.V5 W.V1 W.V2 W.V3 W.V4 W.V5 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## Y.V1 Y.V2 Y.V3 Y.V4 Y.V5 Z.V1 Z.V2 Z.V3 Z.V4 Z.V5 
##   17   17   17   17   17   18   18   19   20   20

plot(hc_DTW2,main="DTW, single linkage")

hc_Euc2@cluster

## A.V1 A.V2 A.V3 A.V4 A.V5 B.V1 B.V2 B.V3 B.V4 B.V5 C.V1 C.V2 C.V3 C.V4 C.V5 
##    1    1    1    1    1    1    1    1    1    1    2    2    2    2    2 
## D.V1 D.V2 D.V3 D.V4 D.V5 E.V1 E.V2 E.V3 E.V4 E.V5 G.V1 G.V2 G.V3 G.V4 G.V5 
##    3    3    3    3    3    4    4    4    4    4    5    5    6    5    6 
## H.V1 H.V2 H.V3 H.V4 H.V5 L.V1 L.V2 L.V3 L.V4 L.V5 M.V1 M.V2 M.V3 M.V4 M.V5 
##    1    1    1    1    1    2    2    2    2    2    7    8    8    8    8 
## N.V1 N.V2 N.V3 N.V4 N.V5 O.V1 O.V2 O.V3 O.V4 O.V5 P.V1 P.V2 P.V3 P.V4 P.V5 
##    9    9    9    9    9    5    5    5    5    5   10    1    1    1    1 
## Q.V1 Q.V2 Q.V3 Q.V4 Q.V5 R.V1 R.V2 R.V3 R.V4 R.V5 S.V1 S.V2 S.V3 S.V4 S.V5 
##   11   11   11   11   12   13    1    1    1    1    5    5    5    5    5 
## U.V1 U.V2 U.V3 U.V4 U.V5 V.V1 V.V2 V.V3 V.V4 V.V5 W.V1 W.V2 W.V3 W.V4 W.V5 
##    1    1    1    1    1    1    1    1   13    1    1    1    1    1    1 
## Y.V1 Y.V2 Y.V3 Y.V4 Y.V5 Z.V1 Z.V2 Z.V3 Z.V4 Z.V5 
##   14   14   14   15   15   16   17   18   19   20

plot(hc_Euc2,main="Euclidean, single linkage")

Example 2: Temperature and precipitation data

Here, we specify the number of clusters to correspond with 10 climate divisions as determined by the NCDC. As in Ravishanker et al. (2010), we have more success separating the western and eastern part of the state, but near the coast, it becomes more difficult to distinguish clusters.

Source: NOAA (http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/regional_monitoring/CLIM_DIVS/states_counties_climate-divisions.shtml)

# averaging M smallest for the final statistic

Q_star0 = apply(Q_st, 2, sort) 
Q_star0 = Q_star0[1:M,] 
Q_star = apply(Q_star0, 2, mean) 

# clustering
pdist = 0*diag(P)
i = 2
counter = 0
for (j in 1:(P-1)) {
  for (i in (j+1):P) {
    counter = counter + 1
    pdist[i,j] = (1-Q_star)[counter]
  }
}
pdist = pdist + t(pdist)


Z = hclust(as.dist(pdist),method="complete") ;
Z2 = hclust(as.dist(pdist),method="single") ;

#Cut the tree into 10 clusters
Zcut = cutree(Z, k=10)
Zcut2 = cutree(Z2, k=10)

## Compare to locations
longlat <- data.frame()
for (i in stat_un) {
  longlat <- rbind(longlat, c(long[STATION==i][1],lat[STATION==i][1]))
}
names(longlat) <- c("longitude","latitude")

## Get WA lat/long
WA = map_data('state', region='Washington')

par(mfrow=c(2,2))
plot(longlat,pch='')
text(longlat$longitude,longlat$latitude,labels=Zcut)
points(WA$long,WA$lat,pch=19,cex=0.5,col="gray")
plot(as.dendrogram(Z),main="Clustered by Q*, Complete linkage")

plot(longlat,pch='')
text(longlat$longitude,longlat$latitude,labels=Zcut2)
points(WA$long,WA$lat,pch=19,cex=0.5,col="gray")
plot(as.dendrogram(Z2),main="Clustered by Q*, Single linkage")

Questions/Homework

Q0: Supplement your analysis of multivariate time series with clustering/classification if seems appropriate.

Q1: In defining the DTW distance, a symmetric 2 step pattern is often used where moving along the diagonal costs the double, compared to moving horizontally left or vertically down. What is the logic behind this assumption?

Q2: In defining the \(Q^*\) distance, why is it meaningful to average the lowest values of the likelihood ratio statistics? Why not average the highest or average all of the values of the likelihood ratio statistics?

References

Aghabozorgi, S., A. S. Shirkhorshidi, and T. Y. Wah. 2015. “Time-Series Clustering–A Decade Review.” Information Systems 53: 16–38.

Conradsen, K., A. A. Nielsen, J. Schou, and H. Skriver. 2003. “A Test Statistic in the Complex Wishart Distribution and Its Application to Change Detection in Polarimetric Sar Data.” IEEE Transactions on Geoscience and Remote Sensing 41 (1): 4–19.

Giorgino, T., and others. 2009. “Computing and Visualizing Dynamic Time Warping Alignments in R: The Dtw Package.” Journal of Statistical Software 31 (7): 1–24.

Hastie, T., R. Tibshirani, and J. H. Friedman. 2008. The Elements of Statistical Learning. Springer.

Liao, T. W. 2005. “Clustering of Time Series Data—a Survey.” Pattern Recognition 38 (11): 1857–74.

Lichman, M. 2013. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Ravishanker, N., J. R. M. Hosking, and J. Mukhopadhyay. 2010. “Spectrum-Based Comparison of Stationary Multivariate Time Series.” Methodology and Computing in Applied Probability 12 (4): 749–62.

Sardá-Espinosa, A. 2016. “Comparing Time-Series Clustering Algorithms in R Using the Dtwclust Package.”