## The Convergence Guarantees of a Non-convex Approach for Sparse Recovery

1 Introduction

Since the introduction of compressive sensing, sparse recovery has received much attention and becomes a very hot topic these years. Sparse recovery aims to solve the underdetermined linear system ${\bf y}={\bf Ax}$, where ${\bf y}\in\mathbb{R}^M$ denotes the measurement vector, ${\bf A}\in\mathbb{R}^{M\times N}$ is a sensing matrix with more columns than rows, i.e., $M<N$, and ${\bf x}=(x_i)\in\mathbb{R}^N$ is the sparse or compressible signal to be recovered.

Many algorithms have been proposed to solve the sparse recovery problem. If $\bf x$ is sparse, one typical method is to consider the following $\ell_0$-minimization problem

\begin{align}\label{l0opt}\underset{\bf x}{\operatorname{argmin}}\|{\bf x}\|_0\ \ \textrm{subject to}\ \ {\bf y}={\bf Ax}.\end{align}

However, it is not practical to adopt this method since it is usually solved by combinatorial search, which is NP-hard. An alternate method is to replace the $\ell_0$ norm'' with the convex $\ell_1$ norm, i.e.,

\begin{align}\label{l1opt}\underset{\bf x}{\operatorname{argmin}}\|{\bf x}\|_1\ \ \textrm{subject to}\ \ {\bf y}={\bf Ax}.\end{align}

It is certified that under some certain conditions, the optimal solution of $\ell_1$-minimization is identical to that of $\ell_0$-minimization. This conclusion greatly reduces the computational complexity, since $\ell_1$-minimization can be reformulated as a linear program (LP), and be solved by numerous efficient algorithms.

Another family of sparse recovery algorithms is put forward based on non-convex optimization

\begin{align}\label{lFopt}\underset{\bf x}{\operatorname{argmin}} J({\bf x})\ \ \textrm{subject to}\ \ {\bf y}={\bf Ax},\end{align}

where $J(\cdot)$ is a sparsity-inducing penalty. These algorithms include FOCUSS, IRLS, reweighted $\ell_1$-minimization, SL0, DC algorithm, ISL0, and ZAP. It is theoretically proved and experimentally verified that for some certain non-convex penalties, $J$-minimization tends to derive the sparse solution under weaker conditions than $\ell_1$-minimization. However, the inherent deficiency of multiple local minima in non-convex optimization limits its practical usage, where improper initial criteria might cause the solution trapped into the wrong ones.

We aim to provide theoretical convergence guarantees of a non-convex approach for sparse recovery from the initial solution to the global optimum. The question, which naturally appears and mainly motivates our work, is raised as follows.

Does there exist a computationally tractable algorithm that guarantees to find the sparse solution to $J$-minimization? If yes, in what circumstances does this statement hold?

In this work, exploiting the concept of weak convexity to characterize the non-convexity of the penalties, the mentioned question is replied as follows.

A computationally tractable non-convex approach is proposed with guarantees that it converges to the sparse solution provided that the non-convexity of the penalty is below a threshold.

This work has been published at IEEE Transactions on Signal Processing in the paper The Convergence Guarantees of a Non-convex Approach for Sparse Recovery by Laming Chen and Yuantao Gu.

2 Performance Evaluation of $J$-minimization in the Noiseless Scenario

Our work adopts weakly convex sparseness measure to constitute the sparsity-inducing penalty $J(\cdot)$. The definition of weakly convex sparseness measure is proposed as follows.

Definition 1: The weakly convex sparseness measure $F:\mathbb{R}\rightarrow\mathbb{R}$ satisfies

1)       $F(0)=0$, $F(\cdot)$ is even and not identically zero;

2)       $F(\cdot)$ is non-decreasing on $[0,+\infty)$;

3)       The function $t\mapsto F(t)/t$ is non-increasing on $(0,+\infty)$;

4)       $F(\cdot)$ is a weakly convex function on $[0,+\infty)$.

Figure 1: The weakly convex sparseness measures listed in TABLE I are plotted. The parameter $p$ is set to $0.5$. The parameter $\sigma$ is set respectively so that they all contain the point $(0.9,0.9)$.

Definition 1 is essentially a combination of the concepts of sparseness measure and weak convexity. As listed in TABLE I and plotted in Fig. 1, most commonly used non-convex penalties are formed by weakly convex sparseness measures.

Lemma 1: The weakly convex sparseness measure $F(\cdot)$ satisfies the following properties:

1)       $F(\cdot)$ is continuous and there exists $\alpha>0$ such that $F(t)\le\alpha|t|$ holds for all $t\in\mathbb{R}$;

2)       For any constant $\beta>0$, $F(\beta t)$ is also a weakly convex sparseness measure, and its corresponding parameters are $\rho_{\beta}=\beta^2\rho$ and $\alpha_{\beta}=\beta\alpha$.

The following theorem evaluates the performance of $J$-minimization for tuple $(J,{\bf A},{\bf x})$ under certain circumstances.

Theorem 1: Assume the tuple $({\bf A},K)$ satisfies $\gamma(\ell_0,{\bf A},K)<1$ and the vector ${\bf x}^*$ satisfies $\|{\bf x}^*\|_0\le K$. For any penalty $J(\cdot)$ formed by $F(\cdot)$ satisfying Definition 1 and that $F(\cdot)$ is bounded, the global optimum $\hat{\bf x}^{\beta}$ of the problem

\begin{align}\label{opthl0}\underset{\bf x}{\operatorname{argmin}}J(\beta{\bf x})\ \ \operatorname{subject\ to}\ \ {\bf Ax}={\bf Ax}^*\end{align}satisfies\begin{align*}\lim_{\beta\rightarrow+\infty}\|\hat{\bf x}^{\beta}-{\bf x}^*\|_2=0.\end{align*}

Theorem 1 reveals that for a fixed sparse signal ${\bf x}^*$, the performance of $J$-minimization is close to that of $\ell_0$-minimization when the corresponding weakly convex sparseness measure is bounded and its non-convexity is large enough. One may notice that the condition in Theorem 1 is $\gamma(\ell_0,{\bf A},K)<1$ rather than $\gamma(J,{\bf A},K)<1$ or $\gamma(\ell_1,{\bf A},K)<1$. As a matter of fact, $\gamma(\ell_0,{\bf A},K)<1$ is equivalent to the requirement of $M\ge2K+1$ and that any $2K$ column vectors of $\bf A$ are linearly independent, which is a much weaker condition than $\gamma(\ell_1,{\bf A},K)<1$. Therefore, for some $K$-sparse signals, they cannot be recovered by $\ell_1$-minimization, but can be recovered by $J$-minimization as shown in Theorem 1.

Recalling that the null space constant is a tight quantity for tuple $(J,{\bf A},K)$, a result on the performance of $J$-minimization is further derived from another perspective of view.

Theorem 2: For any penalty $J(\cdot)$ formed by weakly convex sparseness measure $F(\cdot)$ satisfying Definition 2, the null space constant satisfies

\begin{align}\gamma(J,{\bf A},K)=\gamma(\ell_1,{\bf A},K).\end{align}

According to Theorem 2, for any tuple $({\bf A},K)$ and penalty $J(\cdot)$ formed by weakly convex sparseness measure, the performance of $J$-minimization is the same as that of $\ell_1$-minimization in the worst case sense.

3 Projected Generalized Gradient Method in the Noisy Scenario

Borrowing the idea of projected subgradient method, we propose a non-convex algorithm to solve the $J$-minimization problem. Mathematically, initialized as the pseudo-inverse solution ${\bf x}(0)={\bf A}^{\dagger}{\bf y}$ where ${\bf A}^{\dagger}={\bf A}^{\rm T}({\bf AA}^{\rm T})^{-1}$ denotes the pseudo-inverse matrix of $\bf A$, the iterative solution ${\bf x}(n)$ obeys

\begin{align}\tilde{\bf x}(n+1)&={\bf x}(n)-\kappa\nabla J({\bf x}(n)),\label{itera1}\\{\bf x}(n+1)&=\tilde{\bf x}(n+1)+{\bf A}^{\dagger}({\bf y}-{\bf A}\tilde{\bf x}(n+1)),\label{itera2}\end{align}

where $\kappa>0$ denotes the step size and $\nabla J({\bf x})$ is a column vector whose $i$th element is $f(x_i)\in\partial F(x_i)$ which denotes the generalized gradient set of $F(\cdot)$ at $x_i$. Since the generalized gradient is adopted to update the iterative solutions, this method is termed projected generalized gradient (PGG) method.

The following theorem reveals the performance of PGG in the noisy scenario ${\bf y}={\bf Ax}^*+{\bf e}$ where ${\bf x}^*$ is the $K$-sparse signal to be recovered and $\bf e$ is the additive noise to the measurement vector.

Theorem 3: (Performance of PGG). For any tuple $(J,{\bf A},K)$ with $J(\cdot)$ formed by weakly convex sparseness measure $F(\cdot)$ and $\gamma(J,{\bf A},K)<1$, and for any positive constant $M_0$, if the non-convexity of $J(\cdot)$ satisfies

\begin{align}\label{theoremfor4}\frac{-\rho}{\alpha}\le\frac{1}{M_0}\frac{1-\gamma(J,{\bf A},K)}{5+3\gamma(J,{\bf A},K)},\end{align}

the recovered solution $\hat{\bf x}$ by PGG satisfies

\begin{align}\|\hat{\bf x}-{\bf x}^*\|_2\le \frac{4\alpha^2N}{C_1}\kappa+8C_2\|{\bf e}\|_2\end{align}

provided that $\|{\bf x}^*\|_0\le K$ and $\|{\bf x}(0)-{\bf x}^*\|_2\le M_0$, where

\begin{align}C_1=&\frac{F(M_0)}{M_0}\frac{1-\gamma(J,{\bf A},K)}{1+\gamma(J,{\bf A},K)},\label{lemmafor4}\\C_2=&\frac{\alpha\sqrt{N}+C_1}{C_1\sigma_{\min}({\bf A})},\label{lemmafor5}\end{align}

where $\sigma_{\min}({\bf A})$ denotes the smallest nonzero singular value of ${\bf A}$.

According to Theorem 3, under some certain conditions, if the non-convexity of the penalty is below a threshold (which is in inverse proportion to the distance between the initial solution and the sparse signal), the recovered solution of PGG will get into the $(O(\kappa)+O(\|{\bf e}\|_2))$-neighborhood of ${\bf x}^*$. By choosing sufficiently small step size $\kappa$, the influence of the $O(\kappa)$ term can be omitted, and the PGG method returns a stably recovered solution. If $\rho=0$, $J(\cdot)$ is just a scaled version of the $\ell_1$ norm, and the condition in Theorem 3 always holds for all $M_0>0$. Therefore, no constraint needs to be imposed on the distance between the initial solution and the sparse signal. This is consistent in the fact that $\ell_1$-minimization is convex and the initial solution can be arbitrary. In addition, larger non-convexity of the penalty induces smaller $M_0$, i.e., stronger constraint on the distance between the initial solution and the sparse signal, which is also an intuitive result.

4 Extension and Discussion

The initialization and the projection step of the PGG method involves the pseudo-inverse matrix ${\bf A}^{\dagger}$, whose exact calculation may be computationally intractable or even impossible because of its large scale in practical applications. To reduce the computational burden, a uniform approximate pseudo-inverse matrix of $\bf A$ is adopted. This method is termed approximate PGG (APGG) method. According to the approximate calculation of the pseudo-inverse matrix, we use ${\bf A}^{\rm T}{\bf B}$ to denote the approximation of ${\bf A}^\dagger$. To characterize the approximate precision of the pseudo-inverse matrix, define

\begin{align*}\|{\bf I}-{\bf A}{\bf A}^{\rm T}{\bf B}\|_2\le\zeta\end{align*}

where $\|\cdot\|_2$ denotes the spectral norm of the matrix, and we assume $\zeta<1$ throughout this paper. Similar to Theorem 3, the following theorem shows the performance of APGG in the noisy scenario.

Theorem 4: (Performance of APGG). For any tuple $(J,{\bf A},K)$ with $J(\cdot)$ formed by weakly convex sparseness measure $F(\cdot)$ and $\gamma(J,{\bf A},K)<1$, and for any positive constant $M_0$, if the non-convexity of $J(\cdot)$ satisfies (\ref{theoremfor4}) and the approximate pseudo-inverse matrix ${\bf A}^{\rm T}{\bf B}$ satisfies $\zeta<1$, the recovered solution $\hat{\bf x}$ by APGG satisfies

\begin{align}\label{theoremfor5}\|\hat{\bf x}-{\bf x}^*\|_2\le 2C_3\kappa+2C_4\|{\bf e}\|_2\end{align}

provided that $\|{\bf x}^*\|_0\le K$ and $\|{\bf x}(0)-{\bf x}^*\|_2\le M_0$, where

\begin{align}C_3&=\max\left\{2C_2C_5,\frac{2d\alpha^2N}{C_1}+C_6\right\},\label{lemmafor6}\\C_4&=\max\left\{2C_2,C_7\right\},\label{lemmafor7}\\C_5&=2\frac{\zeta\alpha\sqrt{N}\|{\bf A}\|_2}{1-\zeta},\label{lemmafor8}\\C_6&=\frac{2\|{\bf B}\|_2C_5}{C_1}\left(2(1+\zeta)\alpha\sqrt{N}\|{\bf A}\|_2+(3+\zeta)C_5\right),\label{lemmafor9}\\C_7&=\frac{4\|{\bf B}\|_2}{C_1}\left(\alpha\sqrt{N}\|{\bf A}\|_2+C_5\right),\label{lemmafor10}\end{align}

$d=\|{\bf I}-{\bf A}^{\rm T}{\bf BA}\|_2^2$, and $C_1$ and $C_2$ are respectively specified as (\ref{lemmafor4}) and (\ref{lemmafor5}).

Theorem 4 reveals that under some certain conditions, if the non-convexity of the penalty is below a threshold, the recovered solution of APGG will get into the $(O(\kappa)+O(\|{\bf e}\|_2))$-neighborhood of ${\bf x}^*$. This result is interesting since the influence of the approximate projection is only reflected on the coefficients instead of an additional error term. In the noiseless scenario with sufficiently small step size $\kappa$, the sparse signal ${\bf x}^*$ can be recovered with any given precision, even when a uniform approximate projection is adopted in this method.

For compressible signal ${\bf x}^*$, assume $\|{\bf x}^*-{\bf x}_T^*\|_2\le\tau$. According to Theorem 4, it is easily calculated that the recovered solution $\hat{\bf x}$ of APGG will get into the $(2C_3\kappa+2C_4(\|{\bf e}\|_2+\|{\bf A}\|_2\tau))$-neighborhood of ${\bf x}_T^*$. Since ${\bf x}_T^*$ lies in the $\tau$-neighborhood of ${\bf x}^*$, the distance between $\hat{\bf x}$ and ${\bf x}^*$ will be no more than

\begin{align*}2C_3\kappa+2C_4\|{\bf e}\|_2+(2C_4\|{\bf A}\|_2+1)\tau.\end{align*}

This reflects the performance degradation due to the noise and non-sparsity of the original signal.

5 Numerical Simulation

Figure 2: The figure shows the recovery performance of the PGG method with different sparsity-inducing penalties and different choices of non-convexity when the nonzero entries of the sparse signal satisfy Gaussian distribution. The corresponding sparseness measures are from TABLE I. The problem dimensions are $M=200$ and $N=1000$, and $K_{\max}$ is the largest integer which guarantees $100\%$ successful recovery.

The first experiment tests the recovery performance of the PGG method in the noiseless scenario with different sparsity-inducing penalties and different choices of non-convexity. The penalties are formed by sparseness measures in TABLE I. The results when the nonzero entries of the sparse signal satisfy Gaussian distribution are presented in Fig. 2. As can be seen from the results, as the non-convexity of the sparsity-inducing penalty increases, the performance of PGG improves at first, and degenerates when the non-convexity continues to grow. When the non-convexity approaches zero, the performances of these penalties are close to that of the $\ell_1$ penalty. The results support that as the non-convexity increases, the performance of $J$-minimization improves, and verify Theorem 3 that the non-convexity should be smaller than a threshold to guarantee the convergence of PGG.

Figure 3: The figure compares the successful recovery probability of different algorithms versus sparsity $K$ with $M=200$ and $N=1000$ when the nonzero entries of the sparse signal satisfy Gaussian distribution. The approximate precision of approximate ${\bf A}^{\dagger}$ is $\zeta=0.91$.

In the second experiment, the recovery performance of (A)PGG is compared in the noiseless scenario with some typical sparse recovery algorithms. The simulation results when the nonzero entries of the sparse signal satisfy Gaussian distribution are demonstrated in Fig. 3. As can be seen, for both distributions, IRLS, PGG, and APGG guarantee successful recovery for larger sparsity $K$ than the other references. It also reveals that in the noiseless scenario with sufficiently small step size, the approximate projection has little influence on the recovery performance of APGG.