## Compressed Subspace Clustering

1 Motivation

Distinguished from the traditional clustering methods based on Euclidean distance, subspace clustering methods assign the same label to data points lying in the same subspace. As the Euclidean distance between length limited data points would become approximately the same when the dimensionality of the ambient space is high, subspace clustering methods could give more reliable results in big data problems. Among the subspace clustering methods, Sparse subspace clustering (SSC) is a prevalent one, which is based on the property that data points in the same subspace could be  mutually linear represented. This technique possesses a wide range of applications, including network data analysis, image segmentation, and medical image processing, etc.

As SSC is of more benefits when dealing with high dimensional data, a vitally important problem to be figured out is how to reduce the high computational cost coming with high dimensionality. Focusing on solving subspace clustering problem on set of high-dimensional data, we proposed Compressed Subspace Clustering (CSC) by performing subspace clustering on random measurements of the data set. This work was presented in the paper Compressed Subspace Clustering: A Case Study in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP'14) by Xianghui Mao and Yuantao Gu.

2 Intuition

The core idea of CSC is to reduce the dimensionality of original data via random projection, as illustrated in Fig. 1. After the dimensionality reduction, the self-expression step in SSC is performed with less restrictions, which significantly save the computational sources. Meanwhile, CSC saves the memory space by permitting merely dealing with adequate amount of random measurements instead of the high-dimensional original data points. Till now, the only question left is whether the compressed data points can be clustered accurately as well. We luckily found that in certain situation when the amount of measurements is not too small, the answer is Yes!

Figure 1: Flowchart of Compressed subspace clustering (CSC).

3 Theoretical results

We consider the scenario of $L$ subspaces $\{S_1,\cdots, S_L\}$, each of which is of dimension $d$, belonging to the ambient space $\mathbb{R}^M$. In each subspace $S_l$, $N=\rho d+1$ data points are drawn uniformly at random from a unit sphere $S^{d-1}$, obtaining a data set $\mathcal{X}_1=\{{\bf{x}}_i^{(l)}\}_{i=1}^N$, where $\rho$ represents the sampling rate per dimension. Given the data set, we aim to partition the data set $\mathcal{X}$ into $\mathcal{X}_1\, \cdots\, \mathcal{X}_L$.

Suppose a compressed data set $\mathcal{Y}=\mathcal{Y}_1,\cdots,\mathcal{Y}_L$, $\mathcal{Y}_l=\{{\bf{y}}_i^{(l)}\}_{i=1}^{N}$, is obtained by ${\bf{y}}_i^{(l)}={\bf{A}}\bf{x}_i^{(l)}$, where ${\bf{A}} \in \mathbb{R}^{n\times M}$ is a  Gaussian random matrix with entries $a_{i,j}\sim \mathcal{N}\left(0,\frac{1}{n}\right).$  With these compressed data, SSC algorithm is directly applied to cluster the compressed data set $\mathcal{Y}$. Herein, we define an operator $\mathcal{A}$ mapping from $\mathrm{Gr}(d,M)$ to $\mathrm{Gr}(d,n)$ satisfying that $\mathcal{A}S_l$ is spanned by $\mathcal{Y}_l$, where $\mathrm{Gr}(d,M)$ denotes the Grassmannian of $d$-dimensional subspaces of an $M$-dimensional vector space.

Theorem 1 When the affinity of two subspaces $S_k$ and $S_l$, $\mathrm{aff}\left(S_k,S_l\right)$, equals zero,  it holds that

$$\begin{split}&P\left(\left|{\mathrm {aff}^2 \left(\mathcal{A}S_k,\mathcal{A}S_l\right)}-\frac{d^2}{n}\right|\leq \delta \right) \geq 1-\min\left(\frac{2d^2}{\left({n}^2-1\right){\delta}^2}, \mathrm{e}^{\frac{-2{\delta}^2}{d^2}}\right).\end{split}$$

Theorem 2  Let $S_k$ and $S_l$ be any two $d$-dimensional subspaces. Suppose $S_k\cap S_l$ is of dimension $m$, and the orthogonal complements of $S_k\cap S_l$ with regard to $S_k$ and $S_l$ are orthogonal to each other. It can be derived that

$$\begin{split}&P\left(\left|\mathrm{aff}^2\left(\mathcal{A}S_k,\mathcal{A}S_l\right)-m-\frac{{\left(d-m\right)}^2}{n-m}\right|\leq \delta \right) \geq 1-\min\left(\frac{2{\left(d-m\right)}^2}{[{\left(n-m\right)}^2-1]{\delta}^2}, \mathrm{e}^{\frac{-2{\delta}^2}{{\left(d-m\right)}^2}}\right).\end{split}$$

Theorem 1 and 2 both states the concentration property of compressed subspace affinity. The estimate of the square of compressed subspace affinity, $m+\frac{{\left(d-m\right)}^2}{n-m}$, is jointly determined by the amount of measurements, the subspace dimensionality,  and the dimensionality of the intersected subspace. Based on the above understanding of compressed subspace affinity and existing theoretical results on SSC, we come up with the compression bound in CSC as follow.

Corollary 1 Suppose there are $L$ subspaces. For every pair $S_k$ and $S_l$, assume $S_k\cap S_l$ is of the same dimension $m$, and the orthogonal complements of $S_k\cap S_l$ with regard to $S_k$ and $S_l$ are orthogonal to each other. Then the subspace detection property holds with probability at least $1-\epsilon_1-\epsilon_2-\epsilon_3$ as long as

$$m+\delta < \mathrm{aff}_0^2,\quad \mathrm{and} \quad n > n_0=m+ \frac{{(d-m)}^2}{\mathrm{aff}_0^2-m-\delta},$$

where

\begin{align} \mathrm{aff}_0&=\frac{c(\rho)\sqrt{\beta d\log{\rho}}}{4(2\log{N}+t)} \thicksim \mathcal{O}\left(\sqrt{d}\right),\nonumber\\ \epsilon_1&=N\mathrm{e}^{-\rho (1-\beta)d}, \nonumber\\ \epsilon_2&=\frac{2}{\rho d(\rho d+1)}\mathrm{e}^{-2t},\nonumber\\  \epsilon_3&={\frac{L^2-L}{2}}\min\left(\frac{2{(d-m)}^2}{[{(n-m)}^2-1]{\delta}^2}, \mathrm{e}^{\frac{-2{\delta}^2}{{(d-m)}^2}}\right).\nonumber  \end{align}

Corollary 1 states that when subspaces are independent to each other, $n>n_0\sim\mathcal{O}(d)$ random measurements would ensure accurate subspace clustering results with overwhelming probability.

4 Numerical results

We generate $L=8$ subspaces in $\mathbb{R}^{M} (M=120)$, each of which is of dimension $d=10$. Specifically, $\forall l\in\{1,\cdots,L\}$, we choose the corresponding basis $\mathcal{\bf{\psi}}_l$  as orthonormalized Gaussian random matrices in $\mathbb{R}^{M\times d}$. By restricting the first $3$ columns of different $\bf{\psi}_l$s are the same, we have that pairwise intersected subspace is of dimension $m=3$. The points in $S_l,\forall l\in\{1,\cdots,L\}$ are chosen independently according to ${\bf{x}}_j^{(l)}={\bf{\psi}}_l{\bf{a}}_j^{(l)}, j\in\{1,\cdots,N\}$, where ${\bf{a}}_j^{(l)}$ satisfies $\mathcal{N}(0, \frac{1}{d}{\bf{I}}_d)$. We evaluate the clustering performance by accuracy rate $r:=N_s/LN$ , where $N_s$ is the amount of correctly classified data points. Average accuray of CSC in different settings and performance comparison between SSC and CSC are are shown in Fig. 2 and Fig. 3 respectively. The numerical results reveal that by compressed subspace clustering, the computational cost is greatly reduced with limited loss of clustering accuracy rate.

Figure 2: Average accuracy rate of compressed subspace clustering, where x-axis represents the amount of random measurements per data point, and y-axis represents the amount of data points per subspace.

Figure 3: Performance comparison, with the solid lines with circle markers denoting the performance of SSC, and the dashed lines with square markers denoting that of CSC with $n=22$. The blue lines represent accuracy rates, while red lines represent computation time. As a reference criterion, the black dotted line stands for the 95% accuracy rate.