Nonstationary III: Wavelets

What is this all about?

Recall from the previous lecture the definition of WFT, \[ S_f(u,\xi) = \int_{\mathbb R} f(t) g_{(u,\xi)}(t) dt = \int_{\mathbb R} f(t) e^{-i\xi t} g(t-u) dt \] with atoms \(g_{(u,\xi)}(t) = e^{-i\xi t} g(t-u)\), and the definition of CWT, \[ W_f(u,s) = \int_{\mathbb R} f(t) g_{(u,s)}(t) dt = \int_{\mathbb R} f(t) \frac{1}{\sqrt{s}} \psi\Big( \frac{t-u}{s} \Big) dt \] with atoms \(g_{(u,s)}(t) = \frac{1}{\sqrt{s}} \psi\Big( \frac{t-u}{s} \Big)\). The information in WFT and CWT has lots of redundancy as discussed previously. Could it be that a countable collection \[ \{S_f(u_n,\xi_n)\}\quad \mbox{or}\quad \{W_f(u_n,s_n)\} \] provides as much information as the respective “continuous” collections? Could it be that the collection of atoms \(\{g_{(u_n,\xi_n)}\}\) or \(\{g_{(u_n,s_n)}\}\) forms a suitable, preferably orthogonal, basis?

Data/Theory goals of the lecture

We will see that this involves difficulties for WFT but also that things look rosier for CWT. In particular, we will discuss the so-called orthogonal wavelet basis and discrete wavelet transform (DWT). DWT will then be illustrated on several applications.

No less importantly, the introduction of wavelets has also provided a multiscale perspective in modeling (when a phenomenon is modeled across different scales).

Several \(\verb!R!\) packages focus on DWT and its statistical applications, including \(\verb!wavethresh!\) (associated with the textbook of Nason (2008)), \(\verb!wavelets!\), \(\verb!waveslim!\), and the previously used \(\verb!Rwave!\). Much of this lecture will use \(\verb!wavethresh!\).

Some inspiration

For more info, see the 2017 Abel prize announcement. Announced on March 21.

Discrete bases

To address the question raised above, the following notion is useful. Let \(\|f\|^2 = \int |f(t)|^2 dt\) and \(\langle f,g \rangle = \int f(t) \overline{g(t)} dt\).

Definition 3.1: A collection of functions \(\{\phi_n\}\) is called a frame if, for some constants \(0<A,B<\infty\) and any \(f\), \[ A \|f\|^2 \leq \sum |\langle f,\phi_n\rangle|^2 \leq B \|f\|^2. \] The constants \(A,B\) are called frame bounds. The case \(A=B\) is referred to as a tight frame. The case \(A>1\) (supposing \(\|\phi_n\|=1\)) is referred to as a redundant frame.

Some known useful facts

\(\{\phi_n\}\) is an orthonormal basis iff \(\{\phi_n\}\) is a frame with \(A=B=1\). A frame \(\{\phi_n\}\) with \(\|\phi_n\|=1\) is typically not linearly independent. If a frame is linearly independent, it is referred to as a Riesz basis. A linearly independent frame \(\{\phi_n\}\) with \(\|\phi_n\|=1\) implies \(A\leq 1\leq B\).
Key fact: If \(\{\phi_n\}\) is a frame with bounds \(A,B\), then \(\exists\) a “dual” frame \(\{\widetilde \phi_n\}\) with bounds \(1/B,1/A\) such that \[ f = \sum \langle f,\phi_n\rangle \widetilde \phi_n = \sum \langle f,\widetilde \phi_n\rangle \phi_n. \] In the case of linear independence, \(\{\phi_n\}\) and \(\{\widetilde \phi_n\}\) are biorthogonal, i.e. \(\langle \phi_p, \widetilde \phi_n\rangle = \delta_{p-n}\).

Example 3.1 (Windowed Fourier frames): Consider the atoms \(g_{(u,\xi)}(t) = e^{-i\xi t} g(t-u)\) of WFT as above. A natural discretization is to consider \(u = m u_0\), \(\xi = n \xi_0\) and hence the functions \[ h_{(m,n)}(t) = e^{-in\xi_0 t} g(t-mu_0),\quad m,n\in\mathbb{Z}. \] Their Heisenberg boxes are often represented as:

When is this a frame? Necessary and sufficient conditions are available, the form of a dual frame is known, etc.

Balian-Low Theorem (1980’s): If \(\{h_{m,n}\}\) is orthogonal, then \[ \int t^2 |g(t)|^2 dt = \infty\quad \mbox{or}\quad \int w^2 |\widehat g(w)|^2 dw = \infty. \] That is, there are no orthogonal windowed Fourier basis with good time-frequency localization.

Note: The case of the Gaussian window is discussed in detail in e.g. Daubechies (1992), pp. 84-87, and Mallat (1998), pp. 142-143.

Example 3.2 (Wavelet frames): Consider the atoms \(g_{(u,s)}(t) = \frac{1}{\sqrt{s}} \psi\Big( \frac{t-u}{s} \Big)\) of CWT as above. A natural discretization is to consider \(u = nu_0a^j\), \(s = a^j\) and hence the functions \[ h_{(n,j)}(t) = \frac{1}{\sqrt{a^j}} \psi\Big( \frac{t-nu_0a^j}{a^j} \Big)= \frac{1}{\sqrt{a^j}} \psi\Big( \frac{t}{a^j} -nu_0 \Big),\quad n,j\in\mathbb{Z}. \] Moreover, \(a=2\) (dyadic scales \(a^j = 2^j\)) and \(u_0=1\) are often taken. Their Heisenberg boxes are often represented as:

Note: Here, with \(a=2\), large positive \(j\) corresponds to large scales and low frequencies, and large negative \(j\) corresponds to small scales and small frequencies. This is the preferred convention in the applied literature. In mathematics, \(j\) is often replaced by \((-j)\).

Similarly to windowed Fourier frames above, necessary and sufficient conditions are available, the form of a dual frame is known, etc.

Analogue of Balian-Low Theorem: It does not hold. In fact, Y. Meyer (of Meyer (1992) and the Abel prize above) was trying to prove the analogue for WT frames but in the process came up with an orthogonal wavelet basis with good time-frequency localization. I. Daubechies (of Daubechies (1992)) later constructed such orthogonal wavelet bases with compact time supports.

Example of an orthogonal wavelet basis: Haar basis

Take \[ \psi(t) = \left\{ \begin{array}{cl} 1, & 0\leq t < \frac{1}{2},\\ -1, & \frac{1}{2}\leq t \leq 1,\\ 0, & \mbox{otherwise}. \end{array} \right. \] Then, \(\{2^{-j/2} \psi(2^{-j}t-n)\}\) is an orthonormal basis, so that \[ f(t) = \sum_{n\in\mathbb{Z}} \sum_{j\in\mathbb{Z}} d_{j,n} 2^{-j/2} \psi(2^{-j}t-n), \] where \[ d_{j,n} = \int f(t) 2^{-j/2} \psi(2^{-j}t-n) dt. \]

In fact, with \[ \phi(t) = \left\{ \begin{array}{cl} 1, & 0\leq t < 1,\\ 0, & \mbox{otherwise}, \end{array} \right. \] one can also write: for any \(J\), \[ f(t) = \sum_{n\in\mathbb{Z}} a_{J,n} 2^{-J/2} \phi(2^{-J}t-n) + \sum_{n\in\mathbb{Z}} \sum_{j=-\infty}^{J} d_{j,n} 2^{-j/2} \psi(2^{-j}t-n), \] where \[ a_{J,n} = \int f(t) 2^{-J/2} \phi(2^{-J}t-n) dt. \] Note also that \[ \frac{1}{\sqrt{2}} \psi(\frac{t}{2}) = \frac{1}{\sqrt{2}} \phi(t) - \frac{1}{\sqrt{2}} \phi(t-1),\quad \frac{1}{\sqrt{2}} \phi(\frac{t}{2}) = \frac{1}{\sqrt{2}} \phi(t) + \frac{1}{\sqrt{2}} \phi(t-1) \] and hence e.g. \[ d_{j+1,p} = \frac{1}{\sqrt{2}} a_{j,2p} - \frac{1}{\sqrt{2}} a_{j,2p+1},\quad a_{j+1,p} = \frac{1}{\sqrt{2}} a_{j,2p} + \frac{1}{\sqrt{2}} a_{j,2p+1}. \]

Orthogonal wavelet basis

There are other orthogonal wavelet bases than the Haar bases, many with nicer properties. All of them are similarly characterized by:

Wavelet function \(\psi\) (also known as “mother wavelet”; it looks like a small “wave”, with \(\int \psi = 0\))
Scaling function \(\phi\) (also known as “father wavelet”; it looks like a small “bump”, with \(\int \phi = C> 0\))

so that \(\{2^{-j/2} \psi(2^{-j}t-n), j,n\in\mathbb{Z}\}\) and \(\{2^{-J/2} \phi(2^{-J}t-n),2^{-j/2} \psi(2^{-j}t-n),j\leq J,n\in\mathbb{Z}\}\) are orthogornal bases, \[ f(t) = \sum_{n\in\mathbb{Z}} \sum_{j\in\mathbb{Z}} d_{j,n} 2^{-j/2} \psi(2^{-j}t-n) = \sum_{n\in\mathbb{Z}} a_{J,n} 2^{-J/2} \phi(2^{-J}t-n) + \sum_{n\in\mathbb{Z}} \sum_{j=-\infty}^{J} d_{j,n} 2^{-j/2} \psi(2^{-j}t-n), \] where \[ d_{j,n} = \int f(t) 2^{-j/2} \psi(2^{-j}t-n) dt,\quad a_{J,n} = \int f(t) 2^{-J/2} \phi(2^{-J}t-n) dt. \] The coefficients \(d_{j,n}\) are called detail (wavelet) coefficients at scale \(j\) (rather \(2^j\)) and the coefficients \(a_{J,n}\) are called approximation coefficients at scale \(j\) (rather \(2^j\)). A similar terminology applies to the detail and approximation terms in the expansion of \(f(t)\).

Moreover, there are sequences \(\{g_n\}\) and \(\{h_n\}\) such that \[ \frac{1}{\sqrt{2}} \psi(\frac{t}{2}) = \sum_n g_n \phi(t-n),\quad \frac{1}{\sqrt{2}} \phi(\frac{t}{2}) = \sum_n h_n \phi(t-n). \] The sequences \(\{g_n\}\) and \(\{h_n\}\) are known as conjugate mirror filters (CMFs), and also characterize an orthogonal wavelet basis. We also have \(g_n = (-1)^n \overline{h_{1-n}}\).

An important property of a wavelet is the number of zero moments, that is, \(Q\) such that \[ \int \psi(t) t^q dt = 0,\quad q=0,\ldots,Q-1. \] There are equivalent condition to express this in terms of CMFs.

Some known examples of orthogonal wavelet bases: Meyer; Daubechies; splines; …

Example of an orthogonal wavelet basis: Daubechies bases

Family of wavelets \(_{N}\psi\) with \(N\geq 1\) zero moments, and scaling function \(_{N}\phi\) such that \(\mbox{supp}\{_{N}\psi\} = [-N+1,N]\), \(\mbox{supp}\{_{N}\phi\} = [0,2N-1]\) (minimum sizes to have \(N\) zero moments). The forms of \(_{N}\psi\), \(_{N}\phi\) are NOT known explicitly, but rather characterized through numerical values of the corresponding CMF \(_{N}h_n\), with \(2N\) non-zero \(_{N}h_n\)’s. (The case \(N=1\) corresponds to Haar wavelets.)

These CMFs can be used to compute numerically the corresponding wavelets and scaling functions. Here are the plots of a few of them.

library(wavethresh)

N <- 3

filter.select(N, family="DaubExPhase")

## $H
## [1]  0.33267055  0.80689151  0.45987750 -0.13501102 -0.08544127  0.03522629
## 
## $G
## NULL
## 
## $name
## [1] "Daub cmpct on ext. phase N=3"
## 
## $family
## [1] "DaubExPhase"
## 
## $filter.number
## [1] 3

#support(filter.number=N, family="DaubExPhase")
draw(filter.number=N, family="DaubExPhase",scaling.function = TRUE,enhance = FALSE)

draw(filter.number=N, family="DaubExPhase",scaling.function = FALSE,enhance = FALSE)

Discrete wavelet transform (DWT)

One of the appealing features of orthogonal wavelets is the availability of fast algorithms that relate wavelet and detail coefficients across various scales. The algorithms are known as (fast) discrete wavelet transform (DWT).

At decomposition: \[ a_{j+1,p} = \sum_n h_{n-2p} a_{j,n} = (a_{j,\cdot} * h^{\vee})_{2p} \] or, in short, \[ a_{j+1} = \downarrow_2 (a_{j} * h^{\vee}), \] where \(*\) is the convolution operation, \(^\vee\) is the time reversion operation, and \(\downarrow_2\) is the downsampling by \(2\) operation (i.e. \((\downarrow_2 x)_n = x_{2n}\)). Similarly, \[ d_{j+1} = \downarrow_2 (a_{j} * g^{\vee}). \]

At reconstruction: \[ a_{j,p} = \sum_n h_{p-2n} a_{j+1,n} + \sum_n g_{p-2n} d_{j+1,n} = (h * \uparrow_2 a_{j+1,\cdot})_{p} + (g * \uparrow_2 d_{j+1,\cdot})_{p} \] or, in short, \[ a_{j} = h * \uparrow_2 a_{j+1} + g * \uparrow_2 d_{j+1}, \] where \(\uparrow_2\) is the upsampling by \(2\) operation (i.e. \((\uparrow_2 x)_n = x_p\) if \(n=2p\) and \(=0\) otherwise).

Decomposition and reconstruction steps as a perect reconstruction filter bank

Discrete data and “wavelet crime”

Approach 1: If data \(x_n\), \(n=1,\ldots,2^p\), is available, set \(a_{0,n} = x_n\), , \(n=1,\ldots,J\), from which the following quantities can be computed, after taking care of “boundary effects” in some way: \[ \begin{array}{ccl} d_{1,n} & : & n = 1,\ldots,2^{p-1}\\ d_{2,n} & : & n = 1,\ldots,2^{p-2}\\ \ldots & & \ldots \\ d_{p,n} & : & n = 1 (=2^{p-p}) \\ a_{p,n}& : & n = 1 \end{array} \] So that \(2^p\) points are transformed into \(2^p\) points through a linear, in fact orthogonal, transformation. This is what is usually plotted as the coefficients of DWT. In fact, the whole theory can be built just for discrete sequences using CMFs. For example, this is the approach taken by Percival and Walden (2000). Computational complexity (!): only \(O(2^p)\).

Approach 2: If the signal is thought as \(f(2^J n)\) for large negative \(J\) and smooth \(f\), note that (assuming \(\int \phi(u)du = 1\)) \[ f(2^J n) \approx \int_{\mathbb R} f(t) \frac{1}{2^J}\phi\Big( \frac{t-2^{J}n}{2^J} \Big) dt = 2^{-J/2} a_{J,n,} \] so that this is what is taken for the initialization of DWT. This is sometimes referred to as a “wavelet crime” in the wavelet literature.

Other approaches are taken as well, especially related to Approach 2.

Examples of DWT

Example 1: Signal with point singularity:

test.data <- example.1()$y
length(test.data)

## [1] 512

log2(length(test.data))

## [1] 9

plot(test.data,type="l")

wds <- wd(test.data,filter.number=10,family="DaubLeAsymm",type="wavelet",bc="periodic")
plot(wds)

## [1] 6.287696 6.287696 6.287696 6.287696 6.287696 6.287696 6.287696 6.287696
## [9] 6.287696

wds2 <- wd(test.data,filter.number=10,family="DaubLeAsymm",type="station",bc="periodic")
plot(wds2)

## [1] 6.489179 6.489179 6.489179 6.489179 6.489179 6.489179 6.489179 6.489179
## [9] 6.489179

library(Rwave)

isize <- length(test.data)
noctave <- 5
nvoice <- 10
pp <- noctave * nvoice
cwt2 <- cwt(test.data, noctave, nvoice, plot=FALSE)
image(1:isize, seq(0, pp, length = pp), Mod(cwt2), xlab = "Time", ylab = "log(scale)",main = "Wavelet Transform Modulus",col = grey(seq(0, 1, length = 256)))

Example 2: Chirp:

test.chirp <- simchirp()$y
length(test.chirp)

## [1] 1024

log2(length(test.chirp))

## [1] 10

plot(test.chirp, main="Simulated chirp signal",type="l")

chirpwds <- wd(test.chirp,filter.number=8,family="DaubLeAsymm",type="wavelet",bc="periodic")
plot(chirpwds)

##  [1] 9.521105 9.521105 9.521105 9.521105 9.521105 9.521105 9.521105
##  [8] 9.521105 9.521105 9.521105

chirpwds2 <- wd(test.chirp,filter.number=8,family="DaubLeAsymm",type="station",bc="periodic")
plot(chirpwds2)

##  [1] 12.05649 12.05649 12.05649 12.05649 12.05649 12.05649 12.05649
##  [8] 12.05649 12.05649 12.05649

isize <- length(test.chirp)
noctave <- 5
nvoice <- 10
pp <- noctave * nvoice
chirpcwt <- cwt(test.chirp, noctave, nvoice, plot=FALSE)
image(1:isize, seq(0, pp, length = pp), Mod(chirpcwt), xlab = "Time", ylab = "log(scale)",main = "Wavelet Transform Modulus",col = grey(seq(0, 1, length = 256)))

Application 1: Denoising (wavelet shrinkage) and compression

The setting is \[ y_t = g\Big( \frac{t}{T} \Big) + \epsilon_t,\quad t=1,\ldots,T, \] where \(\{\epsilon_t\} \sim \mbox{IID}(0,\sigma^2)\), normally distributed (though these can be relaxed with suitable modifications).

Compute DWT of the data, “shrink” the wavelet coefficients and take the inverse DWT for an estimate of \(g\). The hard and soft thresholding rules are based on: \[ \eta_H(d,\lambda) = d 1_{\{ |d|>\lambda\} },\quad \eta_S(d,\lambda) = \mbox{sign}(d) (|d|-\lambda) 1_{\{ |d|>\lambda\} }, \] where \(\lambda\) is a threshold.

Universal thresholding: \(\lambda^u = \sigma \sqrt{2\log T}\) (in practice, \(\sigma\) is estimated); SURE (Stein unbiased risk estimator), Cross validation: see Nason (2008), pp. 96-99.

See e.g. Nason (2008), Sections 6-7 for much more information (density estimation, etc).

# Nason (2008), pp. 90-95

v <- DJ.EX()
x <- (1:1024)/1024 # Define X coordinates too
plot(x, v$bumps, type="l", ylab="Bumps")

ssig <- sd(v$bumps) # Bumps sd
SNR <- 2 # Fix our SNR

sigma <- ssig/SNR
e <- rnorm(1024, mean=0, sd=sigma)
y <- v$bumps + e
plot(x, y, type="l", ylab="Noisy bumps")

#
# Plot wd of bumps
#

xlv <- seq(from=0, to=1.0, by=0.2)
bumpswd <- wd(v$bumps)
plot(bumpswd, main="", sub="", xlabvals=xlv*512, xlabchars=as.character(xlv),xlab="x")

##  [1] 73.4135 73.4135 73.4135 73.4135 73.4135 73.4135 73.4135 73.4135
##  [9] 73.4135 73.4135

#
# Plot wd of noisy bumps for comparison
#
ywd <- wd(y)
plot(ywd, main="", sub="",xlabvals=xlv*512, xlabchars=as.character(xlv),xlab="x")

##  [1] 75.66155 75.66155 75.66155 75.66155 75.66155 75.66155 75.66155
##  [8] 75.66155 75.66155 75.66155

FineCoefs <- accessD(ywd, lev=nlevelsWT(ywd)-1)
sigma <- mad(FineCoefs)
utDJ <- sigma*sqrt(2*log(1024))

ywdT <- threshold(ywd, policy="manual", value=utDJ)
ywr <- wr(ywdT)
plot(x, ywr, type="l")
lines(x, v$bumps, lty=2)

ywdcvT <- threshold(ywd, policy="cv", dev=madmad)
ywrcv <- wr(ywdcvT)
plot(x, ywrcv, type="l", xlab="x", ylab="Cross-val.Estimate") 
lines(x, v$bumps, lty=2)

The same principle can be used in e.g. image compression:

data(lennon)
# 256 x 256 image
image(lennon,col = grey(seq(0, 1, length = 256)))

lwd <- imwd(lennon, filter.number=6)
length(lwd$w7L2[lwd$w7L2!=0])

## [1] 16384

plot(lwd,col = grey(seq(0, 1, length = 256)))

lwdT <- threshold.imwd(lwd)
length(lwdT$w7L2[lwdT$w7L2!=0])

## [1] 3

plot(lwdT,col = grey(seq(0, 1, length = 256)))

lennon2 <- imwr(lwdT,filter.number=6)
image(lennon2,col = grey(seq(0, 1, length = 256)))

Application 2: Quantifying smoothness (roughness)

Recall that \(f(t)\) is \(h\)-Holder continuous if \[ |f(t) - f(u)| \simeq |t-u|^h. \] This implies that (as \(j\to-\infty\) or at small scales) \[ |2^{-j/2} d_{j,k} | \simeq 2^{jh}. \] (Why?)

Focus on the interval \(t,u\in[0,1]\) and allow \(h\) to vary with time, i.e. \(h=h(t)\). Let \[ N^j(h) = \# \{k = 1,\ldots,2^{-j}: |2^{-j/2} d_{j,k} | \simeq 2^{jh}\} \] Suppose that (as \(j\to-\infty\) or at small scales) \[ N^j(h) \simeq 2^{-jD(h)}. \] For fixed \(h\), the value \(D(h)\) is related to various fractal dimensions of the sets \(N^j(h)\) and the plot of \(D(h)\) as a function of \(h\) as multifractal spectrum of the signal \(f(t)\).

Now define the so-called structure function \(S_j(q)\) for real \(q\) (usually positive) as \[ S_j(q) = \sum_{k=1}^{2^{-j}} |2^{-j/2} d_{j,k} |^q. \] Then (as \(j\to-\infty\) or at small scales) \[ S_j(q) \simeq 2^{j \tau(q)} \] where \[ \tau(q) = \inf_h \{qh - D(h)\} =: D^*(q). \]

(Why?) The transformation \(D^*(q)\) is known as the Legendre transform of \(D\), and under mild assumptions \[ \tau^*(h) = (D^*)^*(h) = D(h). \] (In fact, under mild assumptions, one can make a converse argument.) In practice, \(\tau(q)\) is estimated from the structure function of the signal wavelet coefficients, and the numeric Legendre transform is computed to get \(D(h)\). The spectrum \(D(h)\) is then used as a measure of smoothness (roughness) of the signal.

For example, Abry et al. (2013) use these measures to identify forgeries/copies of paintings. For example, the following are two paintings of Charlotte Casper, first the original and then a copy produced several weeks later. Source: Machine Learning and Image Processing for Art Investigation Research Group.

Original:

Copy:

Abry et al. (2013) measure smoothness of the corresponding smaller patches of the two paintings, and consistently find the copy to be “smoother” than the original. For example, the following plots are the multifractal spectra of the indicated patches of the paintings.

Some final notes

The presented material is the tip of the iceberg.

Questions/Homework

Q0: Try some of the considered wavelet methods on real time series.

Q1: Summarize what exactly is known about the windowed Fourier and wavelet frames.

Q2: Describe the construction of the Meyer wavelets.

Q3: Write an R (or other software) function to compute approximately the Daubechies wavelet and scaling functions, and to plot them.

Q4: Explain the various choices of the threshold in wavelet shrinkage, and the differences among them.

References

Abry, P., H. Wendt, and S. Jaffard. 2013. “When van Gogh Meets Mandelbrot: Multifractal Classification of Painting’s Texture.” Signal Processing 93 (3): 554–72.

Daubechies, I. 1992. Ten Lectures on Wavelets. Vol. 61. CBMS-Nsf Regional Conference Series in Applied Mathematics. Society for Industrial; Applied Mathematics (SIAM), Philadelphia, PA.

Mallat, S. 1998. A Wavelet Tour of Signal Processing. Academic Press, Inc., San Diego, CA.

Meyer, Y. 1992. Wavelets and Operators. Vol. 37. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge.

Nason, G. P. 2008. Wavelet Methods in Statistics with R. Use R! Springer, New York.

Percival, D. B., and A. T. Walden. 2000. Wavelet Methods for Time Series Analysis. Vol. 4. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.