## Sparse Subspace Clustering with Downsampling

1 Motivation

With the rapid increase of data amount, a dataset is always composed of huge amount of data points. To reduce the computational cost in such big data problems, it is crucial to pick out and only deal with the most informative data points.

Specifically speaking, efficiently downsampling a big dataset, while at the same time keep the structure of subspaces, becomes an important topic for Sparse Subspace Clustering (SSC), which is a technique to partition unlabeled samples according to the subspaces they located in.

In order to reduce the computational cost and preserving clustering accuracy, a new approach of SSC with Downsampling (SSCD) is proposed in the paper Downsampling for Sparse Subspace Clustering, in 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'15) by Xianghui Mao, Xiaohan Wang, and Yuantao Gu.

2 The proposed algorithm

In SSCD, firstly the number of data points located in the same subspace with a specific data point is estimated utilizing the $\ell_1$ norm of the sparse representation of that data point. Then a downsampling strategy is designed to decimate data points with the probability that is in reverse ratio to the number of data points in the same subspace. As a consequence, the data points in different subspaces are expected to be balanced after the downsampling operation. We compare the procedure of SSC and SSCD in Fig. 1. Please refer to our paper for more details.

Figure 1: An intuitive explanation of SSC and SSCD is provided by clustering 16 data points, denoted by circled numbers, into 3 subspaces, distinguished by different colors. (a) There are two steps included in SSC. S1: sparse self-representation is estimated to discover the relation among data points. S2: spectral clustering is applied on the similarity graph to partition data points and then the structure of subspaces are extracted. (b) SSCD is proposed to improve the performance on large-scale unbalanced datasets. D1: sparse self-representation is estimated, where this step is exactly the same as S1 in SSC. D2: the probability of each data point to be downsampled is determined by its sparse representation, where the circle's area denotes the respective probability. D3: the original dataset is downsampled to produce a small and balanced  subset according to the downsampling probability of respective data points. D4: new sparse self-representation is estimated among the balanced dataset. D5: Applying spectral clustering on the new similarity graph to find the structure of subspaces, where this step is similar to S2 in SSC but the former is executed on a small and balanced dataset.

3 Theoretical results

Let $\chi=\{\mathbf{x}_1, \cdots, \mathbf{x}_N\}, \mathbf{x}_i\in \mathbb{R}^n, \Vert \mathbf{x}_i\Vert_2=1, \forall i\in \{1,\cdots,N\}$ denotes a normalized dataset. This dataset can be partitioned into $L$ subsets, $\chi_l=\{\mathbf{x}_{l(1)},\cdots,\mathbf{x}_{l(N_l)}\}$, where $N_l$ denotes the number of data points in the $l^{\mathrm{th}}$ subset, i.e., $\sum_{l=1}^{L}N_l=N.$ For each $l\in\{1,\cdots,L\}$, the data points in $\chi_l$ lie in a $d$-dimensional subspace $U_l$, where the data points are drawn independently and uniformly. Given the dataset, the problem studied here is to recover the $L$ subspaces. It is assumed that $N_l$s are large numbers, and all subspaces are well separated. The following mathematical results can be achieved from the properties of $\ell_1$-norm optimization formulation and its geometric expression.

Proposition 1: The amount of data points lying in the same subspace with $\mathbf{x}_i$ is approximately in inverse proportion to

$$w_i\triangleq\left(1-\frac1{\Vert \hat{\bf{c}}_i\Vert_1^d}\right)^{\frac{d-1}{2}},$$

where $\hat{\mathbf{c}}_i$ is denoted as follow:

$$\hat{\bf {c}}_i = \arg\min \Vert \mathbf{c}_i\Vert_1,\qquad \mathrm{s.t.}\quad \mathbf{x}_i=\mathbf{Xc}_i,\quad \mathbf{c}_i(i)=0,$$

Proposition 2: The number of data points in the $l^{\rm th}$ subspace, $N_l$, can be approximated by

$${\hat N}_l = C_d{\left(1-\mathbb{E}\left\{\frac{1}{\Vert \hat{\bf c}_i\Vert_1^d}\bigg|{\bf x}_i\in U_l\right\}\right)}^{-\frac{d-1}{2}},$$

where

$$C_d=\frac{1}{2}{\left(\frac{\mathrm{Vol}_d(B^{(d)})}{c_{B^{\left(d\right)}}+o(1)}\right)}^{-\frac{d-1}{2}}$$

is a constant only depending on $d$ and $c_{B^{\left(d\right)}}$ is a constant depending on $d$.

Based on Proposition 1 and 2, by setting the decimating probability of each data point ${\bf x}_i$ as $p_i = \textstyle w_i/\textstyle \sum_{j=1}^N w_j$, the downsampling procedure in SSCD balanced the sampling rate of each subspace cluster. Performing SSC on such downsampled dataset not only guarantees high subspace recovery accuracy but also saves computational time and memory space.

4 Numerical results

In the first experiment, three $5$-dimensional subspaces are generated in $100$-dimensional ambient space. $500, 100$, and $30$ data points are independently uniformly chosen in respective subspaces and then normalized. According to the left subplot of Fig. 2, the decimating probability of each data point is closely related to the data point amount of corresponding subspace cluster, which verifies Proposition 1. Meanwhile, the right subplot of Fig. 2 shows the linearity between the estimated and actual data point amount, which is coincide with the theoretical result in Proposition 2.

Figure 2: The left subplot demonstrates $w_i$ in one trial, where the solid lines denote the means of $w_i$ in respective clusters. The right subplot depicts that the average of $\tilde{N}_l/C_d$ is in good linearity with $N_l$.

In the second experiment, the proposed SSCD method is compared with uniformly downsampling. The experiment setting is the same with those in the first experiment of verifying Proposition 1. The performance of SSCD is tested by varying the downsampling rate $M/N$ from $1\%$ to $30\%$. $100$ trials are taken and the results are demonstrated in Fig. 3. One may read from Fig. 3 that in this scenario only $10\%$ data points of the original datasets are enough for SSCD to recover the structures of all three subspaces, while for uniformly downsampling at least $30\%$ is required. At the same time, the processing time is evidently reduced by SSCD. In order to illustrate the efficiency of the downsampling step in SSCD, the similarity graphs in one trial is displayed in Fig. 4.

Figure 3: The left subplot compares the recovery accuracy of SSC, SSCD, and SSCU. The right subplot displays the processing time of these three methods. In our experiment, $\ell_1$ homotopy method is applied in (S1, D1, D4) steps, random walk spectral clustering with eigengap heuristic is performed in (S2) and (D5) steps.

Figure 4: The similarity matrices generated in SSC (left), SSCU (middle), and SSCD (right). Both downsampled dataset are of cardinality $45$. For convenience, we use yellow squares to label the edges of diagonal blocks.