What is this all about?

Simply put, Hidden Markov Models (HMMs) are models in which your data is generated from an underlying Markov process. The theory of HMMs is incredibly rich and has been derived (often independently) in different areas of math and statistics. For example, Bayesian Networks and HMMs overlap quite a bit. These types of models have a plethora of different application areas including genetics, medicine, climate science/weather prediction, and finance. The literature on the theory of these models is incredibly dense. In addition, the computational techniques used to fit these models and make predictions can become quite complicated. One could (and people do) get a PhD in these topics. The pupose of this lecture is to just give you a basic feel for the way these models work and the settings in which they are useful so we will omit most of the mathematical and computational details, mentioning them where we feel appropriate.

Let’s get started!

State-Space Models

Consider a model with the following dependence structure:

I.e. each observation \(y_t\) is generated by an underlying (unobservable) state \(x_t\). State \(x_{t+1}\) is then dependent only on the previous state. Note that, conditional on \(x_{t+1}\), \(y_{t+1}\) is independent of all previous observations. Mathematically, we can write this model as follows: \[X_{t} = AX_{t-1}+W_{t}\] \[Y_{t} = CX_{t}+V_{t}\] where \(W_t\) and \(V_{t}\) are (usually independent) white noise processes. The first equation is known as the State-Space Equation while the second is called the Observation Equation. (Let’s think of some examples!) Note that \(A\) and \(C\) may be nonlinear operators. However, the computation complexity of estimation and inference becomes more complicated given nonlinearity. In addition, some of the most common models assume the white noise processes are Gaussian which immensely simplifies computation.

HMM Case

In the HMM case we assume that \(X_t\) satisfies the Markov property, i.e. \[P(X_t|X_{t-1}, X_{t-2},...,X_1) = P(X_t|X_{t-1})\] and \(X_t\) is causal variable to \(Y_t\). The observant student may notice that the way we have written our state space equation already implies that \(\{X_t\}\) satifies the Markov property. This is true and state space models are, indeed, HMMs. As was mentioned in the introduction these models have been derived in parallel in various different types of research and often can be referred to by many different names. Whether something is called a “Hidden Markov Model” is usually dependent on the background of the author. In the past, the majority of model referred to as HMMs have an underlying Markov chain with discrete state space. However, continuous state processes are becoming more and more in vogue due to the increase in computational power and advances in simulation techniques. Such models are often useful in Speech recognition, Robot localization, User attention, Medical monitoring, to name a few.

One can think of the simple example of a grad student who is stuck in the basement of Hanes Hall for a week at his basement office, and wants to know if it is rainy or sunny outside. The student has no windows so cannot see for himself if its raining but can observe whether his officemate has an umbrella or not. He decides to model his situation with an HMM. The state process \(X_t\) can take values in {rainy, sunny} and the observations \(Y_t\) can take values in {umbrella, no umbrella}. The student assumes that the underlying Markov chain \(\{X_t\}\) has the following transition probabilities:
The observations \(\{Y_t\}\) then have the following conditional probabilities:
Finally the student assumes that \(X_t\) has the following initial distribution:

Prediction, Filtering, and Smoothing

Consider the following conditional probability: \[P(X_{s:s'}|Y_{k:k'}) = P(X_s,\ldots,X_{s'}|Y_k,\ldots,Y_{k'}),\qquad s<s',k<k'\]

  • Prediction if \(k' < s\) (usually \(s=s'\) and \(k=1\))
  • Filtering if \(s = s' = k' = t\) (usually \(k=1\))
  • Smoothing if \(s' < k'\) and \(k < s\) (usually \(k=1\))

For example, consider the simple example above. Suppose the students wants to estimate the probability that it is currently raining. This would correspond to a filtering problem. In this case, given the observations, one can explicitly compute the corresponding filtered probability. This filtering procedure consists of 2 parts: Prediction and Bayesian update.

First, the prediction step is: \[P(X_s|Y_{1:s-1}) = \sum_{i=1}^sP(X_i|X_{i-1})P(X_{i-1}|Y_{1:i-1})\]

Second, the Bayesian update is: \[P(X_s|Y_{1:s}) = \frac{P(X_s|Y_s)P(X_s|Y_{1:s-1})}{\sum_{i=1}^s P(Y_s|X_s)P(X_t|Y_{1:s-1})}\]

By iterating this process for \(s=1,\ldots,t\) one can compute the filtered probability at time \(t\). This is called the Forward algorithm.

Suppose the student wishes to know if it will rain tomorrow or, perhaps, whether it rained each day this week. These would be referred to as Prediction and Smoothing problems, respectively. Using similar techniques to the one described above, one can obtain prediction and smoothing probabilities explicitly as well.

Quick Note on Estimation and Inference

Before we dive into some examples let’s take a minute to gather our thoughts. The first thing that you may ask yourself is “what is the goal of an HMM?”. Well, there are are several potential area’s of interest.

The first is parameter estimation. Suppose you are simply interested in knowing the structure behind your data. The second is forecasting. Suppose you have a sequence of observations \(\{Y_t\}\) and you wish to forecast what future observations will be. Finally you may be interested in estimating the latent variables \(\{X_t\}\) in your model. Your end goal will depend entirely on the data you have and the particular situation you are in. The nice thing is that HMMs are incredibly flexible.

In addition, there are a ton of ways to accomplish these goals, almost all of them computational. Models can be fit, and values/intervals of interest can be estimated using Maximum Likelihood, Expectation Maximization, Importance Sampling, Particle Filtering, and a variety Bayesian Methods (just to name a few). Each of these methods comes with its own advantages and disadvantages but the one thing in common is that they all become complicated very quickly so we will gloss over almost all of the details in the examples to come.

Example 1: Earthquakes

We consider an example from Zucchini and MacDonald (2009) and Douc, Moulines, and Stoffer (2014). Consider the following plot of number of major earthquakes (magnitude 7 or greater) in the world, 1900-2006. Below is a time-series plot and the ACF/PACF plots for the counts.

library(nltsa)
library(ggplot2)

data("EQcount")
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
layout(matrix(c(1,2, 1,3), nc=2))
plot(EQcount, type='h')
acf(EQcount)
pacf(EQcount)

The simplest way to model this data might be to assume that each is a random draw from a Poisson distribution. We can already see that there is significant autocorrelation. In addition, we can see a good degree of over dispersion. I.e. the sample variance is much greater than the sample mean.

ggplot()+
  geom_bar(mapping = aes(x=EQcount, y = (..count..)/sum(..count..))) +
  geom_line(mapping = aes(x= 0:40,y=dpois(0:40,lambda = mean(EQcount))),colour = 'red') +
   ylab('Proportion of Earthquakes')

mean(EQcount)
## [1] 19.36449
var(EQcount)
## [1] 51.57344

Based on the time series data it appears that there may be periods with higher rates of earthquakes. Let’s fit a HMM! The underlying state process \(\{X_t, t \in\mathbb{N}\}\) is a two-state Markov chain. Then the obervations (counts ) \(\{Y_t,t\in\mathbb{N}\}\) are generated from a Poisson(\(\lambda_{X_t}\)). I.e. the mean of the observation distribution \(\lambda_{x},x\in\{1,2\}\) is dependent on the state process. The idea is that earthquakes occur more frequently in one of the states than the other (e.g. \(\lambda_1<\lambda_2\)). Now let’s first fit out model.

library(depmixS4)
## estimation ##
set.seed(90210)
model <- depmix(EQcount ~1, nstates=2, data=data.frame(EQcount), family=poisson())

(fm <- fit(model))   # fm contains results from estimation 
## iteration 0 logLik: -384.3701 
## iteration 5 logLik: -342.4233 
## iteration 10 logLik: -341.897 
## iteration 15 logLik: -341.8801 
## iteration 20 logLik: -341.8788 
## iteration 25 logLik: -341.8787 
## converged at iteration 26 with logLik: -341.8787
## Convergence info: Log likelihood converged to within tol. (relative change) 
## 'log Lik.' -341.8787 (df=5)
## AIC:  693.7574 
## BIC:  707.1216
u <-as.vector(getpars(fm))
if (u[7] <= u[8]) { para.mle = c(u[3:6], exp(u[7]), exp(u[8])) 
} else {  para.mle = c(u[6:3], exp(u[8]), exp(u[7])) 
}
mtrans = matrix(para.mle[1:4], byrow=TRUE, nrow=2)
lams = para.mle[5:6]   
lams
## [1] 15.41901 26.01445
mtrans
##           [,1]       [,2]
## [1,] 0.9283489 0.07165115
## [2,] 0.1189548 0.88104517
I.e. when in the first state, earthquakes occur at a rate of 15.4 per year while in the second state earthquakes occur at a rate of 26.0 per year (i.e. \(\lambda_1 = 15.4\) and \(\lambda_2=26.0\)). We can also see the estimated transition rate matrix is: \[\begin{equation} M = \left( \begin{matrix} M(1,1) & M(1,2)\\ M(2,1) & M(2,2) \end{matrix}\right) =\left( \begin{matrix} .93 & .07\\ .12 & .88 \end{matrix}\right) \end{equation}\]

We can now take a look at the ACF/PACF functions by simulating from this model:

# function to generate data
pois.HMM.generate_sample = function(n,m,lambda ,Mtrans,StatDist=NULL){
  # n = data length, m = # of states, Mtrans = transition matrix, StatDist = stationary distn 
  if(is.null(StatDist))StatDist =solve(t(diag(m)-Mtrans +1),rep(1,m))
  mvect = 1:m
  state = numeric(n)
  state [1] = sample(mvect ,1,prob=StatDist)
  for (i in 2:n)
    state[i]=sample(mvect ,1,prob=Mtrans[state[i-1] ,])
  y = rpois(n,lambda=lambda[state ])
  list(y= y, state= state)
}
nobs = length(EQcount)
x.star = pois.HMM.generate_sample(n=nobs, m=2, lambda=lams, Mtrans=mtrans)$y
ACFSim = acf(x.star, plot=FALSE)$acf
PACFSim = pacf(x.star, plot=FALSE)$acf
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
layout(matrix(c(1,2, 1,3), nc=2))
plot(x.star, type='h')
acf(EQcount)
  points(0:20,ACFSim, type = 'h', col='red')
pacf(EQcount)
  points(PACFSim, type = 'h', col='red')

Althought it’s not perfect, we are fitting a very basic HMM and we are already capturing a lot of temporal dependance. In addition, we can see below that we are also capturing much of the overdispersion:

mean(x.star)
## [1] 21.1215
var(x.star)
## [1] 58.69265

Finally, we can take a look at our estimated states based on the smoothed probabilities:

pi1 = mtrans[2,1]/(2 - mtrans[1,1] - mtrans[2,2])
pi2 = 1 - pi1

## graphics ##
layout(matrix(c(1,2,1,3), 2))
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))

# data and states
plot(EQcount, main="", ylab='EQcount', type='h', col='#808080')
text(EQcount, col=posterior(fm)[,1], labels=posterior(fm)[,1], cex=.9)

# prob of state 2
plot(ts(posterior(fm)[,3], start=1900), ylab=expression(hat(pi)[~t]*' (2 | data)'))
abline(h=.5, lty=2)

# histogram
hist(EQcount, breaks=30, prob=TRUE, main="")
xvals = seq(1,45)
u1 = pi1*dpois(xvals, lams[1])
u2 = pi2*dpois(xvals, lams[2])
lines(xvals, u1, col=3)
lines(xvals, u2, col=4)

Now consider the sequence of residuals:

resid <- EQcount - lams[posterior(fm)[,1]]
layout(matrix(c(1,2,1,3), 2))
par(mar=c(3,3,1,1), mgp=c(1.6,.6,0))
plot(resid, main="", ylab='Residuals', type='h', col='#808080')
acf(resid)
pacf(resid)

Example 2: Influenza

We now present an example from Shumway and Stoffer (2010). The time series consists of monthly influenza deaths per 10,000 in the U.S. for the 11 years spanning 1968 to 1978.

library(astsa) 
plot.ts(flu)

The researchers are concerned with predicting when there are flu epidemics. This can be modeled using HMMs with two underlying states, epidemic vs. non-epidemic. Suppose that normally we can model influenza mortality (\(y_t\)) as follows: \[y_t = x_{t1}+x_{t3}+v_t\] where \(x_{t1}\) is an AR(2) processes representing seasonal fluctuations, \(x_{t3}\) represents a fixed trend component, and \(v_t\) is white noise with \(var(v_t)=\sigma^2_v\). More explicitly we write, \[x_{t1} = \alpha_1x_{t-1,1}+\alpha_2x_{t-2,1}+w_{t1}\] and \[x_{t3} = x_{t-1,3}+w_{t3}\] where \(w_{t1}\) is white noise with \(var(w_{t1})=\sigma^2_1\) and \(w_{t3}\) is NOT white noise and \(var(w_{t3})=0\) (this is a fancy way of saying \(w_{t3}\) is a constant).

Now suppose that in an epidemic there is a third component \(x_{t2}\) which represents the sharp rise in motality during an epidemic. We model this component as the AR(1) process \[x_{t2}=\beta_0+\beta_1 x_{t-1,2}+w_{t2}\] where \(w_{t2}\) is white noise with \(var(w_{t2})=\sigma^2_2\). Then, during an epidemic, we express our observation equation as \[y_t = x_{t1}+x_{t2}+x_{t3}+v_t.\]

Great! Now let’s write this model in its full state space representation. The state equation should look like this: \[x_t = \Phi x_{t-1} + \Upsilon u_t + w_t.\] TO THE BOARD!!!!

As we have shown: \[\begin{equation} x_t = \left( \begin{matrix} x_{t1}\\ x_{t-1,1}\\ x_{t2}\\ x_{t3} \end{matrix}\right),\quad \Phi = \left( \begin{matrix} \alpha_1 & \alpha_2 & 0 & 0\\ 0 & 1 & 0 & 0 \\ 0 & 0 & \beta_1 & 0\\ 0 & 0 & 0 & 1 \end{matrix}\right),\quad \Upsilon = \left( \begin{matrix} 0\\ 0\\ \beta_0\\ 0 \end{matrix}\right),\quad u_t=1,\quad w_t = \left( \begin{matrix} w_{t1}\\ 0\\ w_{t2}\\ 0 \end{matrix}\right) \end{equation}\] I.e. the state equation is: \[\begin{equation} \left( \begin{matrix} x_{t1}\\ x_{t-1,1}\\ x_{t2}\\ x_{t3} \end{matrix}\right)= \left( \begin{matrix} \alpha_1 & \alpha_2 & 0 & 0\\ 0 & 1 & 0 & 0 \\ 0 & 0 & \beta_1 & 0\\ 0 & 0 & 0 & 1 \end{matrix}\right) \left( \begin{matrix} x_{t-1,1}\\ x_{t-2,1}\\ x_{t-1,2}\\ x_{t-1,3} \end{matrix}\right)+ \left( \begin{matrix} 0\\ 0\\ \beta_0\\ 0 \end{matrix}\right)+ \left( \begin{matrix} w_{t1}\\ 0\\ w_{t2}\\ 0 \end{matrix}\right) \end{equation}\] Then the observation equation is simply \[y_t=A_tx_t+v_t\] where \(A_t = M_1 = [1,0,0,1]\) normally and \(A_t=M_2 = [1,0,1,1]\) during an epidemic. We will assume this sequence of states is a Markov Chain with transition probability matrix \[\begin{equation} \Pi = \left( \begin{matrix} .75 & .25 \\ .25 & .75 \end{matrix}\right) \end{equation}\]
M1 = as.matrix(cbind(1,0,0,1))  # obs matrix normal
M2 = as.matrix(cbind(1,0,1,1))  # obs matrix flu epi

There are now several goals ahead of us:

  1. Estimate our parameters \(\Theta = (\alpha_1,\alpha_2,\beta_0,\beta_1,\sigma_1,\sigma_2,\sigma_v)\). This is accomplished using computational techniques to maximize the likelihood function (we well not go into detail on this).

  2. Obtain one-step-ahead predictions for when there will be an epidemic. I.e. try to forecast, one month in advance, whether there will be an epidemic.

Step 1: Define your likelihood function. Because of theory, the log-likelihood is of the form: \[\begin{equation} \ln L_Y(\Theta) = \sum_{t=1}^n\ln\left(\sum_{j=1}^m\pi_j(t|t-1)f_j(t|t-1)\right) \end{equation}\]

Note that in this case \(m=2\) because there are two states of the system (normal and epidemic). \(\pi_j(t|t-1)\) will represent the one-step-ahead probabilities. I.e. \(\pi_2(t|t-1)\) represents the probability that the system will be in an epidemic at time \(t\) given data up to time \(t-1\). There is much more structure/theory going on behind the scenes than presented here but the important thing to keep in mind is that \(\pi_j(t|t-1)\) depends on \(\Pi\).

# Function to Calculate Likelihood
y = as.matrix(flu)
num = length(y)
nstate = 4;
prob = matrix(0,num,1); yp = y  # to store pi2(t|t-1) & y(t|t-1)
xfilter = array(0, dim=c(nstate,1,num)) # to store x(t|t)
Linn = function(para){
  alpha1=para[1]; alpha2=para[2]; beta0=para[3]      
  sQ1=para[4];  sQ2=para[5];  like=0
  xf=matrix(0, nstate, 1)  # x filter
  xp=matrix(0, nstate, 1)  # x predict
  Pf=diag(.1, nstate)      # filter covar
  Pp=diag(.1, nstate)      # predict covar
  pi11 <- .75 -> pi22;  pi12 <- .25 -> pi21; pif1 <- .5 -> pif2            
  phi=matrix(0,nstate,nstate)
  phi[1,1]=alpha1; phi[1,2]=alpha2; phi[2,1]=1; phi[4,4]=1 
  Ups = as.matrix(rbind(0,0,beta0,0))
  Q = matrix(0,nstate,nstate)
  Q[1,1]=sQ1^2; Q[3,3]=sQ2^2; R=0  # R=0 in final model
  # begin filtering 
  for(i in 1:num){
    xp = phi%*%xf + Ups; Pp = phi%*%Pf%*%t(phi) + Q
    sig1 = as.numeric(M1%*%Pp%*%t(M1) + R)
    sig2 = as.numeric(M2%*%Pp%*%t(M2) + R)
    k1 = Pp%*%t(M1)/sig1; k2 = Pp%*%t(M2)/sig2 
    e1 = y[i]-M1%*%xp; e2 = y[i]-M2%*%xp
    pip1 = pif1*pi11 + pif2*pi21; pip2 = pif1*pi12 + pif2*pi22;
    den1 = (1/sqrt(sig1))*exp(-.5*e1^2/sig1); 
    den2 = (1/sqrt(sig2))*exp(-.5*e2^2/sig2);
    denom = pip1*den1 + pip2*den2;
    pif1 = pip1*den1/denom; pif2 = pip2*den2/denom;
    pif1=as.numeric(pif1); pif2=as.numeric(pif2)
    e1=as.numeric(e1); e2=as.numeric(e2)
    xf = xp + pif1*k1*e1 + pif2*k2*e2
    eye = diag(1, nstate)
    Pf = pif1*(eye-k1%*%M1)%*%Pp + pif2*(eye-k2%*%M2)%*%Pp 
    like = like - log(pip1*den1 + pip2*den2)
    prob[i]<<-pip2; xfilter[,,i]<<-xf; innov.sig<<-c(sig1,sig2)
    yp[i]<<-ifelse(pip1 > pip2, M1%*%xp, M2%*%xp)  
  }    
  return(like)   
}

Step 2: Estimation. Spoiler alert: when the model was ran for the first time the parameters \(\beta_1\) and \(\sigma_v\) were insignificant so they were removed from the model. Note this means that during an epidemic there is only a level shift in the mortality rate and there is also no measurement error. The likelihood function is maximized using a quasi-Newton-Raphson method:

alpha1=1.4; alpha2=-.5; beta0=.3; sQ1=.1; sQ2=.1
init.par = c(alpha1, alpha2, beta0, sQ1, sQ2)

(est = optim(init.par, Linn, NULL, method="BFGS", hessian=TRUE, control=list(trace=1,REPORT=1)))
## initial  value -236.860522 
## iter   2 value -313.882451
## iter   3 value -320.373891
## iter   4 value -322.538581
## iter   5 value -326.037522
## iter   6 value -326.220434
## iter   7 value -328.165001
## iter   8 value -337.802054
## iter   9 value -339.102455
## iter  10 value -339.278478
## iter  11 value -339.295742
## iter  12 value -339.295935
## iter  13 value -339.295949
## iter  13 value -339.295949
## final  value -339.295949 
## converged
## $par
## [1]  1.40570967 -0.62198715  0.21049042  0.02310306  0.11217287
## 
## $value
## [1] -339.2959
## 
## $counts
## function gradient 
##       69       13 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##           [,1]       [,2]        [,3]       [,4]       [,5]
## [1,]  781.6937  779.66092   112.58312   1956.837  -32.89720
## [2,]  779.6609 1000.99490    82.68173   -107.393  -53.07502
## [3,]  112.5831   82.68173  1786.32460  -3798.730 -594.69751
## [4,] 1956.8370 -107.39297 -3798.72986 418827.665 7721.44658
## [5,]  -32.8972  -53.07502  -594.69751   7721.447 3892.96236
SE = sqrt(diag(solve(est$hessian)))    
u = cbind(estimate=est$par, SE)
rownames(u)=c("alpha1","alpha2","beta0","sQ1","sQ2"); 
u
##           estimate          SE
## alpha1  1.40570967 0.078587727
## alpha2 -0.62198715 0.068733109
## beta0   0.21049042 0.024625302
## sQ1     0.02310306 0.001635291
## sQ2     0.11217287 0.016684663

YAYAYAYAYAY!!!! We’ve fit out model! Note not only have we fit our parameters but also computed our one-step-ahead probabilities. In plot (a) below we have plotted the original data (solid-line) along with the estimated one-step-ahead probabilities. In (b) we round the probabilities to 1 or 0.

predepi = ifelse(prob<.5,0,1); k = 6:length(y)       
Time = time(flu)[k]

par(mfrow=c(2,1), mar=c(2,3,1,1)+.1, cex=.9)
plot(Time, y[k], type="o", ylim=c(0,1),ylab="")
lines(Time, prob[k],  lty="dashed", lwd=1.2)
text(1979,.75,"(a)")

plot(Time, y[k], type="o", ylim=c(0,1),ylab="")
lines(Time, predepi[k],  lty="dashed", lwd=1.2)
text(1979,.75,"(b)")

Below, the dotted lines show confidence intervals for the mortality rates. Note that these are one-step-ahead CI’s.

plot(Time, y[k], type="p", pch=1, ylim=c(.1,.9),ylab="")
prde1 = 2*sqrt(innov.sig[1]); prde2 = 2*sqrt(innov.sig[2])
prde = ifelse(predepi[k]<.5, prde1,prde2)
lines(Time, yp[k]+prde, lty=2, lwd=1.5)
lines(Time, yp[k]-prde, lty=2, lwd=1.5)

Finally we show a decomposition of the three filtered components \(x_{t1},x_{t2},x_{t3}\).

plot(Time, xfilter[1,,k], type="l", ylim=c(-.1,.4), ylab="")
lines(Time, xfilter[3,,k]); lines(Time, xfilter[4,,k])

Just for fun, we refitted the model with the transition probability as parameters. I’m omitting the code but the analogous output is as follows:

## initial  value -236.860522 
## iter   2 value -313.915875
## iter   3 value -320.395110
## iter   4 value -322.589054
## iter   5 value -326.300102
## iter   6 value -326.645930
## iter   7 value -339.498612
## iter   8 value -345.527929
## iter   9 value -350.415329
## iter  10 value -351.176512
## iter  11 value -351.756986
## iter  12 value -353.050129
## iter  13 value -353.113650
## iter  14 value -353.119098
## iter  15 value -353.120880
## iter  16 value -353.121849
## iter  17 value -353.122026
## iter  17 value -353.122026
## final  value -353.122026 
## converged
## $par
## [1]  1.36328558 -0.59576499  0.20542276  0.02326422  0.11489695  2.70259894
## [7]  1.04261253
## 
## $value
## [1] -353.122
## 
## $counts
## function gradient 
##       75       17 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##              [,1]       [,2]         [,3]         [,4]        [,5]
## [1,]  745.2237426 724.365135   -17.926333   1346.41611  155.193104
## [2,]  724.3651354 918.775224     4.365372     15.85268   68.345003
## [3,]  -17.9263332   4.365372  1765.663749  -1811.42447 -465.898955
## [4,] 1346.4161085  15.852675 -1811.424466 397867.00311 6339.956190
## [5,]  155.1931040  68.345003  -465.898955   6339.95619 3115.962601
## [6,]    4.3705429   1.717474     7.594773     37.15773   -4.151962
## [7,]    0.4882843  -6.508209   -10.149621    363.52564   13.917451
##            [,6]        [,7]
## [1,]  4.3705429   0.4882843
## [2,]  1.7174741  -6.5082092
## [3,]  7.5947732 -10.1496209
## [4,] 37.1577327 363.5256372
## [5,] -4.1519625  13.9174511
## [6,]  6.8156936  -0.1866078
## [7,] -0.1866078   3.8459440
##           estimate          SE
## alpha1  1.36328558 0.078832534
## alpha2 -0.59576499 0.070913541
## beta0   0.20542276 0.024454206
## sQ1     0.02326422 0.001685588
## sQ2     0.11489695 0.018704894
## pi11    2.70259894 0.386351399
## pi22    1.04261253 0.550438995
## [1] 0.9371798
## [1] 0.7393538

References

Douc, Randal, Eric Moulines, and David Stoffer. 2014. Nonlinear Time Series: Theory, Methods and Applications with R Examples. CRC Press.

Shumway, Robert, and David Stoffer. 2010. Time Series Analysis and Its Applications: With R Examples. Springer Science & Business Media.

Zucchini, Walter, and Iain MacDonald. 2009. Hidden Markov Models for Time Series: An Introduction Using R. Vol. 22. CRC press Boca Raton.