Thursday, September 27, 2012

Sampling Rate is Not Only about Pixels: How to compare the sampling rate between your camera and "single-pixel" camera


At the beginning, I'd like to make clear two terms: "Nyquist frequency" and "Nyquist rate".
Some of you may take them as the same thing, and even some textbooks did so. However, although they are quite similar, they are actually different.
Reference: http://en.wikipedia.org/wiki/Nyquist_frequency
Nyquist Frequency: A property of system. For a given system, sampling rate $f_s$ is fixed. The Nyquist frequency of the system is the allowed highest frequency of a signal that could be sampled without aliasing, which is $f_s / 2$.
Nyquist Rate: A property of signal. For a given band-limited or band-pass signal, bandwidth B is fixed. The Nyquist rate of the signal is the necessary lowest sampling frequency of a system that could sample the signal without aliasing, which is $2B$.

Then for an (imaging system, image signals), you can interpret the sampling rate is the number of pixels, the highest frequency of the image is the resolution of the image.
Then the Nyquist says here, for a imaging system, the number of pixels (sampling rate) is fixed, the highest allowed resolution of the pixel is determined by/proportional to the number of pixels. For different kinds of measure of resolution, the proportion factor differs.
Meanwhile, for an image signal, whose resolution (band-width) is fixed, the required pixels is then determined by/proportional to the resolution.

So the number of pixels is actually the number of samples a system takes for a given image.
Then does it mean "single-pixel" means the number of samples is 1? The answer is definitely no.
So how to interpret the "single"-pixel?

It's a little tricky, but still can be understood. The only thing you need to do is adding a time-axis.
The traditional imaging system in your cameras actually taking samples at one time, each sample is a pixel. So the total number of samples is (number of pixels in a unit of time) x (unit of time). Since the (unit of time) is 1 here, we usually ignore it.
For the "signal-pixel" camera, the number of samples is (number of pixels in a unit of time) x (unit of time). Here (the number of pixels in a unit of time) is 1, whereas the (unit of time) is not 1, which determines the number of samples.
Image sensors are silicon chips that capture and read light. So in traditional imaging system, we need thousands of image sensors, whereas in single pixel camera, we only use one image sensor to capture and read light.
That's the trade-off, number of pixels (image sensors) and unit of time (exposure time).

Therefore, improved interpretation of Nyquist sampling here is to consider the sampling rate being (number of pixels in a unit of time) x (unit of time), i.e., the product of space sampling and time sampling.

How to understand "Nyquist sample v.s. CS"?

In this blog, I will discuss from a communication system point of view about the Nyquist sample and CS theory. A communication system usually samples and processes a 1D signal. So here we are talk about 1D signal cases.


What does Nyquist say for 1D signal?


As shown in the figure, for a analog signal, we need to first sample it to digital signal, and then process it, and de-sample/reconstruct it into analogy signal again.
Therefore, the "process" part is called digital signal processing (DSP).
The Nyquist sampling happens in the "sampling" part, before DSP.
Then what does Nyquist say?
That is, given a signal with the highest frequency component being $f_m$. The sufficient condition for return from the digital signal $x[n]$ to $x(t)$ is that the sampling frequency $f_s$ is at least $2f_m$.
It should be noticed that this is only a sufficient, not necessary condition.
In certain cases, the condition is necessary, e.g., the signal is band-limited. If the whole limited band is occupied by the signal, or we don't know which portion of band the signal occupied, in other words, the only information we've got is the highest frequency of the signal,
(we call this kind of signal strict-sense band-limited signal), then the only way is to use Nyquist rate to sample the signal.
However, you may argue that, in practice, the signal may not band-limited, then what should we do?
The answer is making the signal band-limited before we sample it. This can be done using a LPF. And it is another story that how the band-limited signal approximates the real signal. Here we only focused on the already band-limited signal.
On the other hand, you may also ask that, if we have a band-pass signal, (of course, the band-pass signal is a band-limited signal, however, not strict sense band-limited signal), then can we have lower sampling frequency.
The answer is yes. To see the method, just think we move the band-pass signal to the base-band, and then it become a strict-sense band-limited signal. Then it is obvious how Nyquist rate works. Actually, we are given more information, i.e., the signal is band-passed from $f_L$ to $f_H$, thus we are able to sample the signal using frequency below $2f_H$.
In this case, we can call such sampling process under-sampling. However, it can be viewed as Nyquist sampling via some conversion.

  • It should also be noticed that here the sampling is uniform sampling. If you are interested in this topic, you can review some references in non-uniform sampling.
  • Also, when it comes to 2D signal, it is a little tricky to see the Nyquist sampling. It's beyond this blog.


What does CS say?



Before we go into CS, I'd like to a more complete communication system.
As you see in the figure, all "information" after the dash line are in format of bits stream. And if we ignore all imperfection of the system in between, e.g. we have perfect channel, perfect modulation scheme. Then the process after dash line is distortion free.
Then we move before the dash line, as we shown in last section, the sampling and de-sampling process can be distortion free.
Then how about the quantization? It brings distortion, and the precision of the quantizer determines the intensity of the distortion.
However, you can set all these asides to understand CS.

As shown in the figure, A/D converter actually is responsible for sampling and quantizationg, D/A converter does reverse things.
And source encoder is responsible for compression.
Then consider, if we can sub-sample the signal, then we may not need to compress the signal. Then the only question is how to reconstruct the signal. CS theory just asserts the signal can be recovered faithfully, if the signal is sparse in certain domain, which is highly incoherent with time domain. For instance, the frequency domain is incoherent with time domain.
The CS system can be imaging as following,

$x[n]$ is corresponding to $r[n]$, however, in the real system, we actually do not have $x[n]$. Instead, we have already compressed $p[n]$, which corresponds to $s[n]$.
From $r[n]$ to $r(t)$ is not a problem. Then how to get $r[n]$ from $s[n]$? That's where CS works.
And as you can see, the A/D (D/A) converter becomes A/I (I/A) converter.
The sampling requirement reduced, while the reconstruction is more powerful. This is another kind of asymmetry compared to traditional "source encoding".

How can CS be more powerful?
Say, what if the signal is sparse in time domain, or in a domain coherent with time domain? Then, according to the CS theory, we should not under-sample in time domain as shown before. We need to under-sample in a domain incoherent with the sparsity domain.
Then how to design it? The most beautiful part of CS is RIP. The RIP says if you design the "compression" part as under-sampling signal in a domain, while the under-sampled part of the domain satisfies RIP, then the signal can be reconstructed faithfully.
To my understanding, (and certainly you can have your own interpretation), a matrix satisfies RIP essentially is a subset of domain, i.e., only a few of basis of the domain, while the domain is incoherent with almost every other domain. That's the "university" of CS theory.
An example is Gaussian matrix, which is almost incoherent with either time domain or frequency domain.
It is a little tricky, but if you take into account the role of "overwhelming probability", it would be easier to understand the inner connection.
I didn't go into the mathematical part of these relationships. But if you look at them as a black box, such interpretation does help.

Then the problem is how to first transform the signal into $\Phi$ domain, to make the following under-sampling possible.
This is the story of A/I converter. I won't go into this topic. But intuitively speaking, it takes the help of convolution/integrator.

And again, for 2D signals, like imaging system, even high-dimensional signals, the Nyquist sample and CS can be interpreted in another way. And a most vivid example is the "single-pixel camera".
I will discuss the 2D signal sampling, especially the imaging systems, in future.

Wednesday, September 26, 2012

Sparse Representation, Compressed Sensing, Single Measurement Vector, Multiple Measurement Vector: connections and differences

Purpose of this blog entry

Sparse representation (SR), compressed sensing (CS), single measurement vector (SMV),  multiple measurement vector (MMV), I was ever confused about these terms. They seems to be the same, however, the way they using to describe problems differs.In this blog, I aim to make clear the inner relationship among them, and the development of these theories.

The Big Picture

To my understanding, the following figure describes the relationship of them. 
To see the inherent relationship, I'd rather like to break the CS into two kinds, 1st generation CS (CS1G) and 2nd generation CS (CS2G). The difference will be discussed later. Nevertheless, such notation is not used in current society. So please not be confused when reading papers.
In the paper, CS is referred to CS2G. And SMV is referred to CS1G.
The notion SMV is first mentioned in [10]. (And in [11], one measurement vectors is used to describe the SMV problem).
We will return to the big picture in the end of this blog.

SMV: SR and CS1G

The SMV model basically presents such problem: $y = \Phi x$
  •  sparse representation: $y$ is viewed as a signal in reality, $\Phi$ is called an over-complete dictionary, $x$ is the sparse representation
    •  we want to find a sparse representation of the signal using a dictionary, and we want the signal to be as sparse as possible
    •  $\Phi \in R^{M \times N}$: if M < N, the dictionary is over-complete, if M = N, the dictionary is complete, if M > N, the dictionary is under-complete. The "complete" is viewed from the signal $y$. Simply speaking, if the length of the representation is larger than that of the signal, then the signal is "over-represented", and thus the dictionary is "over-complete".
    •  2D DCT, DFT, etc., are sparse representation of the signal using a complete dictionary
    •  In this blog, we consider sparse representation of the signal using an over-complete dictionary
  • 1st generation compressed sensing:  $x$ is a sparse signal in itself, i.e., there are only a few non-zero entries in $x$, $\Phi$ is called measurement matrix, $y$ is called measurements
    • we want to sample a sparse signal using less measurements than its length, in other words, compressing the signal.
    • here we only want $\Phi$ to be over-complete in terms of $y$
    • note that "$x$ is a sparse signal in itself " is the identity of CS1G. For CS2G, as we will discussed in next section, $x$ is a sparse signal in domain $\Psi$, which is a generation of CS1G
Therefore, SMV is actually a model of SR and CS1G, with each describes different applications.
And we can also view SR as the reconstruction step of CS1G. In other words, in a sampling-reconstruction system, CS1G is the sampling process, and SR is the reconstruction process, by taking the sample as the signal.
If you have this in mind, you then probably can have a better understanding of following references discussed in this section.

The problem discussed in [1] is that a signal can be reconstructed if certain requirements were met in both cases:
  1. a signal is sparse in frequency domain, and under-sampled, in other words, missing samples, in time domain, 
  2. or a signal is sparse in time domain, and under-sampled in frequency domain
They are actually CS1G problems, i.e., sampling a sparse signal in itself, and reconstructing the signal by solving the $\ell_1$ minimization.
For the first case, $x$ is the frequency components of the signal, $\tilde{\Phi}$ is inverse DFT basis, i.e. $IDFT(x) = \tilde{\Phi} x$  and $\Phi = R\tilde{\Phi}$ is the measurement matrix, where $R$ is a selection matrix, selecting $M$ rows of $\tilde{\Phi}$, $y$ is under-sampled time domain signal.
For the second case, $x$ is the time domain signal, $\Phi = R \tilde{\Phi}$ is the measurement matrix, where $\tilde{\Phi}$ is the DFT basis, $y$ is under-sampled frequency domain signal.
Thus, by taking the frequency domain components (first case), and time domain components (second case) as the original signal, respectively, the problem discussed is essentially a CS1G problem.

I'd like to also mention the following principle proposed in [1], which, I think, is the basic of following theoretical results.
  1. Classical Uncertainty Principle: a function $f$ and its Fourier transform $\hat{f}$ cannot both be highly concentrated. $\Delta t \cdot \Delta w \geq 1$
  2. [1] show a more general principle holds: it is not necessary to suppose that $f$ and $\hat{f}$are concentrated on intervals; instead, they can be just concentrated on a measurable set. $|T| |W| \geq 1-\delta$. And it also applies to sequences. $N_t \cdot N_w \geq N$
    1. CT principle: missing segments of a bandlimited function can be restored stably in the presence of noise if (total measurement of the missing segments) $\cdot$ (total bandwidth) < 1.
    2. DT principle: a wideband signal can be reconstructed from narrow-band data, provided the wideband signal to be recovered is sparse or "impulsive"
Then I have an intuitive understanding on he SMV/SR/CS1G theory: 
  1. a signal cannot be both sparse in two incoherent basis
  2. under-sample the signal in a non-sparse domain, and recover the signal in sparse domain
    • then from the recovered sparse domain signal, we are able to get the original signal
And this is actually the CS2G theory, which do not require $x$ to be sparse in itself. We leave the discussion in next section.
[2]-[5] are a sub-group of researches, related to both CS1G and CS2G theory.
[2] actually discussed the SMV problem from SR's point of view. And it proposed more incoherent basis pairs besides (time, frequency). And these incoherent basis pairs are actually CS2G theory.
[3] improved the constrains related to the replacement of $\ell_0$ minimization with $\ell_1$ minimization in [2].
[4] proved that the condition in [3] is both sufficient and necessary, whereas [3] only proved the sufficiency.
[5] should be an extension of [2]. However, I didn't go into details.

[6] is about CS1G, and can be viewed as a direct extension of [1]. The problem discussed in the paper can be phrased as follows:
  • a N-length discrete time signal $f$ is sparse in time domain, i.e., consists of a superposition of |T| spikes. 
  • sub-sample the signal in frequency domain, i.e., only sample |$\Omega$|  frequency components, instead of N.
  • If |T| $\leq C_M \cdot (\log{N})^{-1} \cdot |\Omega|$, then the reconstruction can be exact with probability at least 1-O($N^{-M^})
  • the reconstruction is via solving $\ell_1$ minimization problem.
See, it is almost the same with [1] in this sense.

In addition, [6] shows that the min TV problem is actually the problem stated in the paper.


CS2G

As I stated before, 
  1. a signal cannot be both sparse in two incoherent basis
  2. under-sample the signal in a non-sparse domain, and recover the signal in sparse domain
The most classical CS theory is expressing this idea.
$y = \Phi x = \Phi \Psi \theta$.
To better understand the essence, we let ($\tilde{\Phi}, \Psi$) be a pair of incoherence basis, $\Phi = R \tilde{\Phi}^T$.
Then, in $\tilde{\Phi}$ domain, $x$ can be expressed as $\gamma = \tilde{\Phi}^T x$, in $\Psi$ domain, $x$ can be expressed as $\theta = \Psi^T x$.
So $y$ is actually achieved by under-sample the signal in non-sparse domain, i.e., $\tilde{\Phi}$ domain.
In other words, $y = R \gamma$. 
And the signal is reconstruct in sparse domain, i.e.,, $\Psi$ domain.
Then the CS1G is actually taking $\Psi$ as identity basis.

[7]-[9] are actually the theoretical foundation work on the CS2G theory, together with [2]-[5].
I didn't find the paper [7]. However, it is stated in [8] that [7] extended the result in [6], and showed the exact recovery holds for other synthesis/measurement pairs. And [8] describes these results, making reading [7] not so necessary. The results are shown as follows:
  • an N-length signal $f$ is sparse in domain $\Phi$, i.e., $\theta = \Phi f$  has only a few nonzero entries.
  • sub-sample the signal in domain $\Psi$, i.e., only sample $|\Omega|$ coefficients, instead of N
  • the reconstruction can be exact, via solving an $\ell_1$ minimization problem.
  • the requirement of the exact reconstruction is about the incoherence between $\Phi$ and $\Psi$.
[8] extended the work of [6] and [7], showing that the signal compressible, not necessarily strict sparse, in certain domain can also be optimally reconstructed, i.e., the reconstruction error using K measurements is as good as the best K-term approximation, with overwhelming probability.
[9] discussed the linear code corrupted by noise problem. However, the RIP condition is proposed in this paper, although it does not show the clear relationship between the decoding and CS. Besides, the Gaussian ensembles are proved to follow RIP with an overwhelming probability. On the other hand, it should be noticed that, if RIP is satisfied, the reconstruction is exact determinedly, whereas in [6] [7] [8], the reconstruction is with overwhelming probability.

Then [13] [14] organized all related theoretical results and formulate the CS problem.
And [12] gives a tutorial for CS theory, which I preferred to recommend for reading.


MMV

The only thing left is MMV. As is shown in the big picture, MMV is a direct extension of SMV, in a different way with the evolution from SMV to CS.
[11] formulate the MMV problem. Similar to SMV, the MMV can be viewed as either a 2D sparse representation problem, or a 2D CS1G problem.
[10], [15] give theoretical results on MMV, which is derived from those on SMV, i.e., [1]-[6].


Discussions


  1. In this blog, I didn't discuss the development from noiseless to noisy case, which is also an interesting and important evolution.
  2. In this blog, I didn't focus on the development from sparse signals to compressible signals.
  3. In this blog, I only discussed the discrete-time situation. However, CS is actually can be, and should be addressed in analog signals, which is the meaning of "sensing".
  4. Since I'm working on CS theory, my knowledge to SR, MMV is limited. Thus, the goal is to discern the CS with SR, SMV, MMV, instead of introducing SR, MMV, which I believe are far more complicated.
  5. Return to the big picture, it is interesting to ask, can we extend MMV similar to the extension form SMV to CS? To the best of my knowledge, recent work of Duarte and Baranuik in [16] discussing the high-dimensional CS, where 2D CS is an special case. However, it differs with MMV in the reconstruction method, since it still solving the reconstruction problem in 1D. I would appreciate your help if you can share your opinions and references on this issue.
  6. At last, I really hope this blog can smooth your perplexity when reading references where you don't know what SR, CS, SMV, MMV refer to. They are on the one hand synonyms and inherently similar, and on the other hand differ slightly from each other in terms of their representations and applications. By the help of this blog, you may have in mind what the reference is discussing when reading, at least, references given in the blog.

[1] Uncertainty principles and signal recovery
[2] Uncertainty principles and idea atomic decomposition
[3] A generalized uncertainty principle and sparse representation
[4] On sparse representation in pairs of bases
[5] Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization
[6] Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information
[7] The role of sparsity and incoherence for exactly reconstructing a signal from limited measurements
[8] Near-Optimal signal recovery from random projections: universal encoding strategies ?
[9] Decoding by linear programming
[10] Sparse representation for multiple measurement vectors (MMV) in an over-complete dictionary
[11] Sparse solutions to linear inverse problems with multiple measurement vectors
[12] An introduction to compressive sampling
[13] Compressive sampling
[14] Compressed sensing
[15] Theoretical results on sparse representations of multiple-measurement vectors
[16] Kronecker compressed sensing

Monday, August 13, 2012

Use IEEE Xplore to do a field study

Want to apply for a graduate program?
Want to know who are productive researchers in this area?
Here are some tips to help you do field study, by the help of IEEE Xplore.

1. Use advanced search
Input key words of this area, e.g. compressed sensing.
I suggest you use "OR" to add new filter words.
Click search.
2. See filters on the left side
Select "Publication Year" with certain range, e.g. 2007 - 2012
Click search again.

Then you will see a list of authors on the left side bar.
They are ordered by number of publications.
You would definitely know those productive researchers in your area.

Go to IEEE Xplore now and check it out~

Friday, August 10, 2012

Basic Understanding of Compressed Sensing

Purpose of this blog entry
During my research on CS, one problem is that I have read several papers related on CS, but different paper have different notations and explanations for CS. Besides, they also propose some new stuff in the paper. So  it takes me time to get to know the inner connections among these gorgeous references. As a result, I'd like to give a summary, according to my understanding, for the very basics of CS. Hope it will help you in your research.

What can CS do?
Well, intuitively speaking, CS is a way to sample and compress signals. You can use CS as a method to transmit or store a signal. One most immediate benefit of CS is that it samples and compresses signals at one time, thus reduces the complexity of encoder.

Why we need CS?
In following scenarios, we may turn to CS.
1) require simple encoder
2) require less bandwidth
3) sample high frequency component of signals, which cannot be sampled using traditional ways, e.g. MRI.
(Due to the limited knowledge of me, there are lots of scenarios omitted. If you have any suggestions, please be kindly inform me.)
Of course, if you want, you can use CS in whatever scenario that requires sample/store/transmit a signal. And the CS theory provides ways to reconstruct the sampled signals.

Signals
Please be advised that, some definitions used in this blog entry may differ from that in some papers. I use my definitions here because according to my understanding, it's better and easier for you to understand the connections between papers.

Let a signal $\mathbf{x} \in \mathbb{R}^N$ has its projection on an orthonormal basis $\mathbf{\Psi} \in \mathbb{R}^{N \times N}$ be $\mathbf{\theta} \in \mathbb{R}^N$, i.e., $\mathbf{\theta} = \mathbf{\Psi}^T \mathbf{x}$.
  • $S$-sparse signal: $\mathbf{x}$ is called $S$-sparse in $\mathbf{\Psi}$ or has sparsity $S$ in $\mathbf{\Psi}$ if $\mathbf{\theta}$ has only $S$ nonzero entries.
  • $C$-compressible signal: $\mathbf{x}$ is called $C$-compressible in $\mathbf{\Psi}$ or has compressibility $C$ in $\mathbf{\Psi}$ if $\mathbf{\theta}$ obeys a power law decay with an exponent of $-C-1/2$.
  • power law decay: $\mathbf{\theta}$ obeys a power law decay if $|\theta|_{(i)} \leq R \cdot i^{-1/p}$, where $R > 0$ is a constant, $0 < p < \infty$, and $|\theta|_{(i)}$ is the $i$th largest magnitude entry, i.e., $|\theta|_{(1)} \geq  |\theta|_{(2)} \geq \cdots \geq |\theta|_{(N)}$.
Then $C = 1/p - 1/2$ for compressible signals.
Mostly, $C$-compressible signals consider only the case $0 < p < 1$, i.e., $C > 1/2$.
It should be noticed that $\mathbf{\theta}$ is also an $S$-sparse or $C$-compressible in $\mathbf{\Psi} = \mathbf{I}$, where $\mathbf{I}$ is an identity basis. Thus, to be precise, we need to consider/indicate in which basis the signal has what sparsity or compressibility. For example, a signal may have sparsity 10 in Fourier basis, but may have sparsity 20 in DCT basis. Some papers would not indicate the sparsity
Best $J$-term approximation
To approximate a signal $\mathbf{\theta}$, we can use the largest $J$ terms (in magnitude) in it, i.e., $\mathbf{\theta}_J$. Therefore,
  1. if $\mathbf{\theta}$ is $S$-sparse in $\mathbf{I}$, then the best $J$-term approximation would have an approximation error equal to 0 if $J \geq S$, i.e., $\mathbf{\theta}_J = \mathbf{\theta}$.
  2. if $\mathbf{\theta}$ is $C$-compressible in $\mathbf{I}$, then the best $S$-term approximation would have an approximation error equal $||\mathbf{\theta} - \mathbf{\theta}_J||_1 \leq G_C \cdot R \cdot J^{-C+1/2}$ and $||\mathbf{\theta} - \mathbf{\theta}_J||_2 \leq G_C \cdot R \cdot J^{-C}$, where $G_C$ is a constant only depend on $C$.
Meanwhile, we should indicate which basis does the best $J$-term approximation consider, e.g., $\mathbf{\theta}_J$ is the best $J$-term approximation in $\mathbf{I}$.
Then, we also say a best $J$-term approximation of $\mathbf{x}$ in $\mathbf{\Psi}$ is $\mathbf{x}_J$, i.e., $\mathbf{\theta}_J = \mathbf{\Psi}^T \mathbf{x}_J$ is the best $J$-term approximation of $\mathbf{\theta = \Psi}^T  \mathbf{x}$ in $\mathbf{I}$. Similarly, we need to consider two attributes when saying a best $J$-term approximation: 1)$J$ and 2)$\mathbf{\Psi}$.
Another thing should be noticed is that since $\mathbf{\Psi}$ is an orthonormal basis, $||\mathbf{x} - \mathbf{x}_J||_2 = ||\mathbf{\theta} - \mathbf{\theta}_J||_2 \leq G_C \cdot R \cdot J^{-C}$, i.e., the MSE has a power law decay. With $J$ increasing, the MSE would decrease.

How CS works?
Generally speaking, CS samples a signal $\mathbf{x}$ using $\mathbf{y = \Phi x = \Phi \Psi \theta = A \theta}$.
  • $\mathbf{A} \in \mathbb{R}^{K \times N}$: sensing matrix, $\mathbf{A = \Phi \Psi}$.
  • $\mathbf{\Phi} \in \mathbb{R}^{K \times N}$: measurement matrix
(I was ever confused the sensing matrix with measurement matrix...)
In this way, the signal is compressed in $\mathbf{y} \in \mathbb{R}^K$.
Reconstructing the signal from $\mathbf{y}$ can be achieved by solving a convex optimization problem, i.e., $\min ||\mathbf{\theta}||_1$ s.t. $\mathbf{y = \Phi \theta}$ by knowing $\mathbf{y}$ and $\mathbf{\Phi}$, and then using $\mathbf{x = \Psi\theta}$ to get $\mathbf{x}$ back. In other words, you can solve $\min ||\mathbf{x}||_1$ s.t. $\mathbf{y = A x}$ by knowing $\mathbf{y}$ and $\mathbf{A}$. They are equivalent.

The foundation of CS theory is to prove that in this sample-reconstruct process, the information in $\mathbf{x}$ can be preserved, and thus $\mathbf{x}$ can be well reconstructed.

In the following demonstration, I will not show you how the theory evolves, rather than how to prove the result. If you like, you can refer to papers I cite to see the details. And a tip for your reading is that please notice what the choice of $\mathbf{A, \Phi, \Psi}$ in CS.

At the very beginning, the idea of CS is presented in [1]. In that paper, it proved that for an $S$-sparse signal in Fourier basis, using CS, the reconstruction would be exact with an overwhelming probability at least $1 - O(N^{-\alpha})$ provided that $K \geq G_\alpha S \log{N}$, where $G_\alpha > 0$ is a constant only depend on the accuracy parameter $\alpha$. That is to say, for a given desired probability of success, a sufficient condition for exactly reconstruction is $K \geq G_\alpha S \log{N}$. It should be noticed that in some paper, it also gives a necessary condition with a similar result, please do not mix them up.

The choice of $\mathbf{\Psi}$ is Fourier basis. The choice of $\mathbf{A}$ is a random selecting matrix, i.e., randomly selecting $K$ elements from $\mathbf{\theta}$, e.g. $\mathbf{A}$ = [0 1 0 0 0; 0 0 1 0 0] which extracts the second and third elements of $\mathbf{\theta}$. Thus $\mathbf{\Phi} = \mathbf{A \Psi}^T$ can be generated accordingly.

It can be seen that if a signal is $S$-sparse in another basis $\mathbf{\Psi}$, CS can still reconstruct it using this way to generate $\mathbf{A}$ and $\mathbf{\Phi}$ by knowing $\mathbf{\Psi}$.

Furthermore, in [4],they extended these results and showed that exact reconstruction hold for other $(\mathbf{\Phi, \Psi})$ pairs.

Then in [2], it began to consider a $C$-compressible signal. It showed that using CS, the reconstruction is optimal with an overwhelming probability at least $1 - O(N^{-\rho / \alpha})$. By optimal, it means that it is generally impossible to obtain a higher accuracy from any set of $K$ measurements whatsover. The reconstruction error obeys $||\hat{\mathbf{x}} - \mathbf{x}||_2 = ||\hat{\mathbf{\theta}} - \mathbf{\theta}||_2 \leq G_{C, \alpha} \cdot R \cdot (K / \log{N})^{-C}$, where $\hat{\mathbf{x}}$ and $\hat{\mathbf{\theta}}$ is the reconstructed signals, $G_{C, \alpha}$ is a constant depending on $C$ and $\alpha$. This indeed says that if one make $K = O(J\log{N})$ measurements using CS, one still obtains a reconstruction error equally as good as using best $J$-term approximation. In other words, the following two cases have similar reconstruction error: 1) knowing everything about $\mathbf{\theta}$ and selecting $J$ largest entries of it, 2) reconstructing $\mathbf{\theta}$ using $K = O(J\log{N})$ measurements without any prior knowledge on it.

In [2], the choice of $\mathbf{\Psi}$ is can be any orthonormal basis and it provided three ways to generate $\mathbf{A}$: 1) Gaussian ensembles; 2) Binary ensembles; 3) Fourier ensembles.

Before goes go, I'd like to introduce the so-called restricted isometry property (RIP). For each integer $k = 1, 2, \dots$, define the $k$-restricted isometry constant $\delta_k$ of a matrix $\mathbf{A}$ as the smallest number such that $(1-\delta_k)||\mathbf{\theta}||_2^2 \leq ||\mathbf{A \theta}||_2^2 \leq (1 + \delta_k) ||\mathbf{\theta}||_2^2$ holds for all $S$-sparse signal vector $\mathbf{\theta}$ in $I$ with $S \leq k$. Therefore, $\mathbf{A}$ should have more than one restricted isometry constant since we can choose any $k$. Usually, we say a matrix $\mathbf{A}$ obeys RIP of order $S$ if $\delta_S$ is not too close to 1.

Then a stronger result is introduced in [5] and [6] (and stated in [3]), which says that, if the $k$-restricted isometry constant $\delta_k$ of $\mathbf{A}$ satisfies certain constraints (with different $k$), e.g. $\delta_{2S} \leq \sqrt{2} - 1$ (several other constraints are stated in [5] and [6]), then the reconstructed signal $\hat{\mathbf{\theta}}$ obeys $||\hat{\mathbf{\theta}} - \mathbf{\theta}||_2 \leq G_0 \cdot ||\mathbf{\theta} - \mathbf{\theta}_S||_1 / \sqrt{S}$ and $||\hat{\mathbf{\theta}} - \mathbf{\theta}||_1 \leq G_0 \cdot ||\mathbf{\theta} - \mathbf{\theta}_S||_1$ for some constant $G_0$, where $\mathbf{\theta}_S$ is the best $S$-term approximation of $\mathbf{\theta}$ in $\mathbf{I}$. This is deterministic, without any probability of success.

To go a step further, you will see that the result is in consistent with previous results. If you replace $||\mathbf{\theta} - \mathbf{\theta}_S||_1$ using the result in "Best $J$-term approximation", you will see the magic.

Then the question comes, how to generate a sensing matrix $\mathbf{A}$ satisfies the requirement of RIP?

How to generate sensing matrix $\mathbf{A}$?
There are lots of ways to generate a sensing matrix $\mathbf{A}$. I'd rather let you refer section "RANDOM SENSING" in [3] for a review. Here I just want to talk about two most common ways: 1) from $\mathbf{A}$ by sampling $N$ column vectors uniformly at random on the unit sphere of $\mathbb{R}^K$; 2) from $\mathbf{A}$ by sampling i.i.d. entries from the normal distribution with mean 0 and variance 1/K. With an overwhelming probability, these matrices obey the RIP constraints (e.g. $\delta_{2S} \leq \sqrt{2} - 1$) provided that $K \geq G \cdot S \log{(N/S)}$, where $G$ is some constant depending on each instance.

Another interesting property is that the RIP constraints can also hold for sensing matrix $\mathbf{A = \Phi \Psi}$ where $\mathbf{\Psi}$ is arbitrary orthonormal basis and $\mathbf{\Phi}$ is an $K \times N$ measurement matrix drawn randomly from a suitable distribution, e.g., with an overwhelming probability, provided that $K \geq G \cdot S \log{(N/S)}$, $\mathbf{A}$ obeys the RIP constraints. This makes CS sampling easier, because we only need to generate a measurement matrix $\mathbf{\Phi}$ without knowing $\mathbf{\Psi}$ at the encoder side.

A reading tree/graph for basic understanding of CS theory
If you read papers from the top layer to bottom layer, you should have deeper and deeper understanding about CS theory.















     

References
[1] Robust Uncertainty Principles: Exact Signal Reconstruction From Highly Incomplete Frequency Information
[2] Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies?
[3] An Introduction To Compressive Sensing
[4] The Role of Sparsity and Incoherence for Exactly Reconstructing a Signal from Limited Measurement
[5] Stable signal recovery from incomplete and inaccurate measurements
[6] Compressed sensing and best k-term approximation