**Purpose of this blog entry**

During my research on CS, one problem is that I have read several papers related on CS, but different paper have different notations and explanations for CS. Besides, they also propose some new stuff in the paper. So it takes me time to get to know the inner connections among these gorgeous references. As a result, I'd like to give a summary, according to my understanding, for the very basics of CS. Hope it will help you in your research.

**What can CS do?**

Well, intuitively speaking, CS is a way to sample and compress signals. You can use CS as a method to transmit or store a signal. One most immediate benefit of CS is that it samples and compresses signals at one time, thus reduces the complexity of encoder.

**Why we need CS?**

In following scenarios, we may turn to CS.

1) require simple encoder

2) require less bandwidth

3) sample high frequency component of signals, which cannot be sampled using traditional ways, e.g. MRI.

*(Due to the limited knowledge of me, there are lots of scenarios omitted. If you have any suggestions, please be kindly inform me.)*

Of course, if you want, you can use CS in whatever scenario that requires sample/store/transmit a signal. And the CS theory provides ways to reconstruct the sampled signals.

**Signals**

Please be advised that, some definitions used in this blog entry may differ from that in some papers. I use my definitions here because according to my understanding, it's better and easier for you to understand the connections between papers.

Let a signal $\mathbf{x} \in \mathbb{R}^N$ has its projection on an orthonormal basis $\mathbf{\Psi} \in \mathbb{R}^{N \times N}$ be $\mathbf{\theta} \in \mathbb{R}^N$, i.e., $\mathbf{\theta} = \mathbf{\Psi}^T \mathbf{x}$.

- $S$-sparse signal: $\mathbf{x}$ is called $S$-sparse in $\mathbf{\Psi}$ or has sparsity $S$ in $\mathbf{\Psi}$ if $\mathbf{\theta}$ has only $S$ nonzero entries.
- $C$-compressible signal: $\mathbf{x}$ is called $C$-compressible in $\mathbf{\Psi}$ or has compressibility $C$ in $\mathbf{\Psi}$ if $\mathbf{\theta}$ obeys a power law decay with an exponent of $-C-1/2$.
- power law decay: $\mathbf{\theta}$ obeys a power law decay if $|\theta|_{(i)} \leq R \cdot i^{-1/p}$, where $R > 0$ is a constant, $0 < p < \infty$, and $|\theta|_{(i)}$ is the $i$th largest magnitude entry, i.e., $|\theta|_{(1)} \geq |\theta|_{(2)} \geq \cdots \geq |\theta|_{(N)}$.

Then $C = 1/p - 1/2$ for compressible signals.

Mostly, $C$-compressible signals consider only the case $0 < p < 1$, i.e., $C > 1/2$.It should be noticed that $\mathbf{\theta}$ is also an $S$-sparse or $C$-compressible in $\mathbf{\Psi} = \mathbf{I}$, where $\mathbf{I}$ is an identity basis. Thus, to be precise,

**we need to consider/indicate in which basis the signal has what sparsity or compressibility**. For example, a signal may have sparsity 10 in Fourier basis, but may have sparsity 20 in DCT basis. Some papers would not indicate the sparsity

**Best $J$-term approximation**

To approximate a signal $\mathbf{\theta}$, we can use the largest $J$ terms (in magnitude) in it, i.e., $\mathbf{\theta}_J$. Therefore,

- if $\mathbf{\theta}$ is $S$-sparse in $\mathbf{I}$, then the best $J$-term approximation would have an approximation error equal to 0 if $J \geq S$, i.e., $\mathbf{\theta}_J = \mathbf{\theta}$.
- if $\mathbf{\theta}$ is $C$-compressible in $\mathbf{I}$, then the best $S$-term approximation would have an approximation error equal $||\mathbf{\theta} - \mathbf{\theta}_J||_1 \leq G_C \cdot R \cdot J^{-C+1/2}$ and $||\mathbf{\theta} - \mathbf{\theta}_J||_2 \leq G_C \cdot R \cdot J^{-C}$, where $G_C$ is a constant only depend on $C$.

Then, we also say a best $J$-term approximation of $\mathbf{x}$ in $\mathbf{\Psi}$ is $\mathbf{x}_J$, i.e., $\mathbf{\theta}_J = \mathbf{\Psi}^T \mathbf{x}_J$ is the best $J$-term approximation of $\mathbf{\theta = \Psi}^T \mathbf{x}$ in $\mathbf{I}$. Similarly, we need to consider two attributes when saying a best $J$-term approximation: 1)$J$ and 2)$\mathbf{\Psi}$.

Another thing should be noticed is that since $\mathbf{\Psi}$ is an orthonormal basis, $||\mathbf{x} - \mathbf{x}_J||_2 = ||\mathbf{\theta} - \mathbf{\theta}_J||_2 \leq G_C \cdot R \cdot J^{-C}$, i.e., the MSE has a power law decay. With $J$ increasing, the MSE would decrease.

Generally speaking, CS samples a signal $\mathbf{x}$ using $\mathbf{y = \Phi x = \Phi \Psi \theta = A \theta}$.

The foundation of CS theory is to prove that in this sample-reconstruct process, the information in $\mathbf{x}$ can be preserved, and thus $\mathbf{x}$ can be well reconstructed.

In the following demonstration, I will not show you how the theory evolves, rather than how to prove the result. If you like, you can refer to papers I cite to see the details. And a tip for your reading is that please notice what the choice of $\mathbf{A, \Phi, \Psi}$ in CS.

At the very beginning, the idea of CS is presented in [1]. In that paper, it proved that for an $S$-sparse signal in Fourier basis, using CS, the reconstruction would be exact with an overwhelming probability at least $1 - O(N^{-\alpha})$ provided that $K \geq G_\alpha S \log{N}$, where $G_\alpha > 0$ is a constant only depend on the accuracy parameter $\alpha$. That is to say, for a given desired probability of success, a

The choice of $\mathbf{\Psi}$ is Fourier basis. The choice of $\mathbf{A}$ is a random selecting matrix, i.e., randomly selecting $K$ elements from $\mathbf{\theta}$, e.g. $\mathbf{A}$ = [0 1 0 0 0; 0 0 1 0 0] which extracts the second and third elements of $\mathbf{\theta}$. Thus $\mathbf{\Phi} = \mathbf{A \Psi}^T$ can be generated accordingly.

It can be seen that if a signal is $S$-sparse in another basis $\mathbf{\Psi}$, CS can still reconstruct it using this way to generate $\mathbf{A}$ and $\mathbf{\Phi}$ by knowing $\mathbf{\Psi}$.

Furthermore, in [4],they extended these results and showed that exact reconstruction hold for other $(\mathbf{\Phi, \Psi})$ pairs.

Then in [2], it began to consider a $C$-compressible signal. It showed that using CS, the reconstruction is optimal with an overwhelming probability at least $1 - O(N^{-\rho / \alpha})$. By optimal, it means that it is generally impossible to obtain a higher accuracy from any set of $K$ measurements whatsover. The reconstruction error obeys $||\hat{\mathbf{x}} - \mathbf{x}||_2 = ||\hat{\mathbf{\theta}} - \mathbf{\theta}||_2 \leq G_{C, \alpha} \cdot R \cdot (K / \log{N})^{-C}$, where $\hat{\mathbf{x}}$ and $\hat{\mathbf{\theta}}$ is the reconstructed signals, $G_{C, \alpha}$ is a constant depending on $C$ and $\alpha$. This indeed says that if one make $K = O(J\log{N})$ measurements using CS, one still obtains a reconstruction error equally as good as using best $J$-term approximation. In other words, the following two cases have similar reconstruction error: 1) knowing everything about $\mathbf{\theta}$ and selecting $J$ largest entries of it, 2) reconstructing $\mathbf{\theta}$ using $K = O(J\log{N})$ measurements without any prior knowledge on it.

In [2], the choice of $\mathbf{\Psi}$ is can be any orthonormal basis and it provided three ways to generate $\mathbf{A}$: 1) Gaussian ensembles; 2) Binary ensembles; 3) Fourier ensembles.

Before goes go, I'd like to introduce the so-called restricted isometry property (RIP). For each integer $k = 1, 2, \dots$, define the $k$-restricted isometry constant $\delta_k$ of a matrix $\mathbf{A}$ as the

Then a stronger result is introduced in [5] and [6] (and stated in [3]), which says that, if the $k$-restricted isometry constant $\delta_k$ of $\mathbf{A}$ satisfies certain constraints (with different $k$), e.g. $\delta_{2S} \leq \sqrt{2} - 1$ (several other constraints are stated in [5] and [6]), then the reconstructed signal $\hat{\mathbf{\theta}}$ obeys $||\hat{\mathbf{\theta}} - \mathbf{\theta}||_2 \leq G_0 \cdot ||\mathbf{\theta} - \mathbf{\theta}_S||_1 / \sqrt{S}$ and $||\hat{\mathbf{\theta}} - \mathbf{\theta}||_1 \leq G_0 \cdot ||\mathbf{\theta} - \mathbf{\theta}_S||_1$ for some constant $G_0$, where $\mathbf{\theta}_S$ is the best $S$-term approximation of $\mathbf{\theta}$ in $\mathbf{I}$. This is deterministic, without any probability of success.

To go a step further, you will see that the result is in consistent with previous results. If you replace $||\mathbf{\theta} - \mathbf{\theta}_S||_1$ using the result in "Best $J$-term approximation", you will see the magic.

Then the question comes, how to generate a sensing matrix $\mathbf{A}$ satisfies the requirement of RIP?

There are lots of ways to generate a sensing matrix $\mathbf{A}$. I'd rather let you refer section "RANDOM SENSING" in [3] for a review. Here I just want to talk about two most common ways: 1) from $\mathbf{A}$ by sampling $N$ column vectors uniformly at random on the unit sphere of $\mathbb{R}^K$; 2) from $\mathbf{A}$ by sampling i.i.d. entries from the normal distribution with mean 0 and variance 1/K. With an overwhelming probability, these matrices obey the RIP constraints (e.g. $\delta_{2S} \leq \sqrt{2} - 1$) provided that $K \geq G \cdot S \log{(N/S)}$, where $G$ is some constant depending on each instance.

Another interesting property is that the RIP constraints can also hold for sensing matrix $\mathbf{A = \Phi \Psi}$ where $\mathbf{\Psi}$ is arbitrary orthonormal basis and $\mathbf{\Phi}$ is an $K \times N$ measurement matrix drawn randomly from a suitable distribution, e.g., with an overwhelming probability, provided that $K \geq G \cdot S \log{(N/S)}$, $\mathbf{A}$ obeys the RIP constraints. This makes CS sampling easier, because we only need to generate a measurement matrix $\mathbf{\Phi}$ without knowing $\mathbf{\Psi}$ at the encoder side.

**How CS works?**Generally speaking, CS samples a signal $\mathbf{x}$ using $\mathbf{y = \Phi x = \Phi \Psi \theta = A \theta}$.

- $\mathbf{A} \in \mathbb{R}^{K \times N}$: sensing matrix, $\mathbf{A = \Phi \Psi}$.
- $\mathbf{\Phi} \in \mathbb{R}^{K \times N}$: measurement matrix

*(I was ever confused the sensing matrix with measurement matrix...)*
In this way, the signal is compressed in $\mathbf{y} \in \mathbb{R}^K$.

Reconstructing the signal from $\mathbf{y}$ can be achieved by solving a convex optimization problem, i.e., $\min ||\mathbf{\theta}||_1$ s.t. $\mathbf{y = \Phi \theta}$ by knowing $\mathbf{y}$ and $\mathbf{\Phi}$, and then using $\mathbf{x = \Psi\theta}$ to get $\mathbf{x}$ back. In other words, you can solve $\min ||\mathbf{x}||_1$ s.t. $\mathbf{y = A x}$ by knowing $\mathbf{y}$ and $\mathbf{A}$. They are equivalent.

The foundation of CS theory is to prove that in this sample-reconstruct process, the information in $\mathbf{x}$ can be preserved, and thus $\mathbf{x}$ can be well reconstructed.

In the following demonstration, I will not show you how the theory evolves, rather than how to prove the result. If you like, you can refer to papers I cite to see the details. And a tip for your reading is that please notice what the choice of $\mathbf{A, \Phi, \Psi}$ in CS.

At the very beginning, the idea of CS is presented in [1]. In that paper, it proved that for an $S$-sparse signal in Fourier basis, using CS, the reconstruction would be exact with an overwhelming probability at least $1 - O(N^{-\alpha})$ provided that $K \geq G_\alpha S \log{N}$, where $G_\alpha > 0$ is a constant only depend on the accuracy parameter $\alpha$. That is to say, for a given desired probability of success, a

**sufficient condition**for exactly reconstruction is $K \geq G_\alpha S \log{N}$. It should be noticed that in some paper, it also gives a necessary condition with a similar result, please do not mix them up.The choice of $\mathbf{\Psi}$ is Fourier basis. The choice of $\mathbf{A}$ is a random selecting matrix, i.e., randomly selecting $K$ elements from $\mathbf{\theta}$, e.g. $\mathbf{A}$ = [0 1 0 0 0; 0 0 1 0 0] which extracts the second and third elements of $\mathbf{\theta}$. Thus $\mathbf{\Phi} = \mathbf{A \Psi}^T$ can be generated accordingly.

It can be seen that if a signal is $S$-sparse in another basis $\mathbf{\Psi}$, CS can still reconstruct it using this way to generate $\mathbf{A}$ and $\mathbf{\Phi}$ by knowing $\mathbf{\Psi}$.

Furthermore, in [4],they extended these results and showed that exact reconstruction hold for other $(\mathbf{\Phi, \Psi})$ pairs.

Then in [2], it began to consider a $C$-compressible signal. It showed that using CS, the reconstruction is optimal with an overwhelming probability at least $1 - O(N^{-\rho / \alpha})$. By optimal, it means that it is generally impossible to obtain a higher accuracy from any set of $K$ measurements whatsover. The reconstruction error obeys $||\hat{\mathbf{x}} - \mathbf{x}||_2 = ||\hat{\mathbf{\theta}} - \mathbf{\theta}||_2 \leq G_{C, \alpha} \cdot R \cdot (K / \log{N})^{-C}$, where $\hat{\mathbf{x}}$ and $\hat{\mathbf{\theta}}$ is the reconstructed signals, $G_{C, \alpha}$ is a constant depending on $C$ and $\alpha$. This indeed says that if one make $K = O(J\log{N})$ measurements using CS, one still obtains a reconstruction error equally as good as using best $J$-term approximation. In other words, the following two cases have similar reconstruction error: 1) knowing everything about $\mathbf{\theta}$ and selecting $J$ largest entries of it, 2) reconstructing $\mathbf{\theta}$ using $K = O(J\log{N})$ measurements without any prior knowledge on it.

In [2], the choice of $\mathbf{\Psi}$ is can be any orthonormal basis and it provided three ways to generate $\mathbf{A}$: 1) Gaussian ensembles; 2) Binary ensembles; 3) Fourier ensembles.

Before goes go, I'd like to introduce the so-called restricted isometry property (RIP). For each integer $k = 1, 2, \dots$, define the $k$-restricted isometry constant $\delta_k$ of a matrix $\mathbf{A}$ as the

**smallest**number such that $(1-\delta_k)||\mathbf{\theta}||_2^2 \leq ||\mathbf{A \theta}||_2^2 \leq (1 + \delta_k) ||\mathbf{\theta}||_2^2$ holds for all $S$-sparse signal vector $\mathbf{\theta}$ in $I$ with $S \leq k$. Therefore, $\mathbf{A}$ should have more than one restricted isometry constant since we can choose any $k$. Usually, we say a matrix $\mathbf{A}$ obeys RIP of order $S$ if $\delta_S$ is not too close to 1.Then a stronger result is introduced in [5] and [6] (and stated in [3]), which says that, if the $k$-restricted isometry constant $\delta_k$ of $\mathbf{A}$ satisfies certain constraints (with different $k$), e.g. $\delta_{2S} \leq \sqrt{2} - 1$ (several other constraints are stated in [5] and [6]), then the reconstructed signal $\hat{\mathbf{\theta}}$ obeys $||\hat{\mathbf{\theta}} - \mathbf{\theta}||_2 \leq G_0 \cdot ||\mathbf{\theta} - \mathbf{\theta}_S||_1 / \sqrt{S}$ and $||\hat{\mathbf{\theta}} - \mathbf{\theta}||_1 \leq G_0 \cdot ||\mathbf{\theta} - \mathbf{\theta}_S||_1$ for some constant $G_0$, where $\mathbf{\theta}_S$ is the best $S$-term approximation of $\mathbf{\theta}$ in $\mathbf{I}$. This is deterministic, without any probability of success.

To go a step further, you will see that the result is in consistent with previous results. If you replace $||\mathbf{\theta} - \mathbf{\theta}_S||_1$ using the result in "Best $J$-term approximation", you will see the magic.

Then the question comes, how to generate a sensing matrix $\mathbf{A}$ satisfies the requirement of RIP?

**How to generate sensing matrix $\mathbf{A}$?**There are lots of ways to generate a sensing matrix $\mathbf{A}$. I'd rather let you refer section "RANDOM SENSING" in [3] for a review. Here I just want to talk about two most common ways: 1) from $\mathbf{A}$ by sampling $N$ column vectors uniformly at random on the unit sphere of $\mathbb{R}^K$; 2) from $\mathbf{A}$ by sampling i.i.d. entries from the normal distribution with mean 0 and variance 1/K. With an overwhelming probability, these matrices obey the RIP constraints (e.g. $\delta_{2S} \leq \sqrt{2} - 1$) provided that $K \geq G \cdot S \log{(N/S)}$, where $G$ is some constant depending on each instance.

Another interesting property is that the RIP constraints can also hold for sensing matrix $\mathbf{A = \Phi \Psi}$ where $\mathbf{\Psi}$ is arbitrary orthonormal basis and $\mathbf{\Phi}$ is an $K \times N$ measurement matrix drawn randomly from a suitable distribution, e.g., with an overwhelming probability, provided that $K \geq G \cdot S \log{(N/S)}$, $\mathbf{A}$ obeys the RIP constraints. This makes CS sampling easier, because we only need to generate a measurement matrix $\mathbf{\Phi}$ without knowing $\mathbf{\Psi}$ at the encoder side.

**A reading tree/graph for basic understanding of CS theory**
If you read papers from the top layer to bottom layer, you should have deeper and deeper understanding about CS theory.

**References**

[1] Robust Uncertainty Principles: Exact Signal Reconstruction From Highly Incomplete Frequency Information

[2] Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?

[3] An Introduction To Compressive Sensing

[4] The Role of Sparsity and Incoherence for Exactly Reconstructing a Signal from Limited Measurement

[5] Stable signal recovery from incomplete and inaccurate measurements

[6] Compressed sensing and best k-term approximation

[5] Stable signal recovery from incomplete and inaccurate measurements

[6] Compressed sensing and best k-term approximation

Great, thank you very much

ReplyDelete