Recovery of Corrupted Low-Rank Tensors

Show more

1. Introduction

The problem of exploiting low-dimensional structures in high-dimensional data is taking on increasing importance in image, text and video processing, and web search, where the observed data lie in very high dimensional spaces. The principal component analysis (PCA) proposed in [1] is the most widely used tool for low-rank component extraction and dimensionality reduction. It is easily implementable and efficient for data mildly corrupted by small noise. However, this PCA method is sensitive to data corrupted by heavy impulse noise or outlying observations. Then, a number of robust PCA methods were proposed. But none of these approaches yield a polynomial-time algorithm with strong performance guarantees under broad conditions. The proposed Robust PCA [2] is one among the earliest polynomial-time algorithms. Assume that a data matrix $X\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}}$ consists of a low-rank matrix ${A}_{0}$ and a sparse matrix ${E}_{0}$ . Then, ${A}_{0}$ and ${E}_{0}$ can be recovered with a high probability by solving the following convex relaxation problem (if ${A}_{0}$ is low-rank and satisfies some incoherent conditions, ${E}_{0}$ is sufficiently sparse):

$\begin{array}{l}\underset{A\mathrm{,}E}{\mathrm{min}}{\Vert A\Vert}_{\mathrm{*}}+\lambda {\Vert E\Vert}_{1}\\ \text{s}\text{.t}\mathrm{.}\text{}X=A+E\end{array}$ (1)

where ${\Vert A\Vert}_{\mathrm{*}}$ denotes the nuclear norm of $A$ and ${\Vert E\Vert}_{1}$ denotes the ${l}_{1}$ -norm of $E$ . Nuclear norm and ${l}_{1}$ -norm are used to induce low rank and sparsity, specifically. $\lambda >0$ is a parameter balancing the low rank and sparsity. Candes et al. [2] prove that $\lambda =1/\sqrt{\mathrm{max}\left({n}_{1},{n}_{2}\right)}$ works with a high probability for recovering any low-rank, incoherent matrix. Notably, Chandrasekaran et al. [3] also consider the problem of decomposing a given data matrix into sparse and low-rank components, and give sufficient conditions for convex programming to succeed. Their work was motivated by applications in system identification and learning of graphical models. In contrast, Candes et al. [2] were motivated by robust principal component computations in high dimensional settings when there were erroneous and missing entries; missing entries were not considered in [3] . In [3] , the parameter $\lambda $ is data-dependent, and may have to be selected by solving a number of convex programs, while Candes et al. [2] provide a universal value of $\lambda $ , namely, $\lambda =1/\sqrt{\mathrm{max}\left({n}_{1},{n}_{2}\right)}$ . The distinction between the two results is a consequence of different assumptions about the origin of the data matrix $X$ .

In many real world applications, we need to consider the model defined in Equation (1) under more complicated circumstance [4] [5] . First, only a fraction of entries of $X$ can be observed. This is the well-known matrix completion problem. If the unknown matrix is known to have low rank or approximately low rank, then accurate and even exact recovery is possible by nuclear norm minimization [6] [7] . Second, the observed data are corrupted by both impulse noise (sparse but large) and Gaussian noise (small but dense). Let $X$ be the superposition of low-rank matrix $A$ , the impulse noise matrix $E$ and the Gaussian noise $F$ , i.e., $X=A+E+F$ . The Gaussian noise of the observed entries is assumed to be small in the sense that ${\Vert {P}_{\Omega}\left(F\right)\Vert}_{F}\le \delta $ , where $\delta $ is the Gaussian noise level and ${\Vert \cdot \Vert}_{F}$ is the Frobenius norm. Then, to be broadly applicable, we consider the following extension of model defined in Equation (1):

$\begin{array}{l}\underset{A\mathrm{,}E}{\mathrm{min}}{\Vert A\Vert}_{\mathrm{*}}+\lambda {\Vert E\Vert}_{1}\\ \text{s}\text{.t}.\text{}{\Vert {\mathcal{P}}_{\Omega}\left(X-A-E\right)\Vert}_{F}\le \delta \end{array}$ (2)

where $\Omega $ is a subset of the index set of entries $\left\{1,2,\cdots ,{n}_{1}\right\}\times \left\{1,2,\cdots ,{n}_{2}\right\}$ . It’s assumed that only these entries $\left\{X-ij\mathrm{,}\left(i\mathrm{,}j\right)\in \Omega \right\}$ can be observed. The operator ${P}_{\Omega}\mathrm{:}{\mathbb{R}}^{{n}_{1}\times {n}_{2}}\to {\mathbb{R}}^{{n}_{1}\times {n}_{2}}$ is a orthogonal projection onto the span of matrices vanishing outside of $\Omega $ so that the ij-th entry of ${P}_{\Omega}\left(X\right)$ is ${X}_{ij}$ if $\left(i\mathrm{,}j\right)\in \Omega $ and zero otherwise. The problem defined in Equation (2) can be solved by the classical Augmented Lagrangian Method (ALM). The separable structure emerging in the objective function and the constraints entails the idea of splitting the corresponding augmented Lagrangian function to derive more efficient numerical algorithms. Tao et al. [5] developed the Alternating Splitting Augmented Lagrangian Method (ASALM) and its variant (VASALM), and the Parallel Splitting Augmented Lagrangian method (PSALM) for solving Equation (2).

One shortcoming of model defined in Equation (2) is that it can only handle matrix (two-way) data. However, the real-world data are ubiquitously in multi-way, also referred to as tensor. For example, a color image is a 3-way object with column, row and color modes; a greyscale video is indexed by two spatial variables and one temporal variable. If we use the model defined in Equation (2) to process the tensor data, we have to unfold the multi-way data into a matrix. Such a preprocessing usually leads to the loss of the inherent structure high-di- mensional information in the original observations. To avoid this negative factor, a common approach is to manipulate the tensor data by taking the advantage of its multi-dimensional structure. Tensor analysis have many applications in computer vision [8] , diffusion Magnetic Resonance Imaging (MRI) [9] [10] [11] , quantum entanglement problem [12] , spectral hypergraph theory [13] and higher-order Markov chains [14] .

The goal of this paper is to study the Tensor Robust PCA which aims to accurately recover a low-rank tensor from impulse and Gaussian noise. The observations can also be incomplete. Tensors of low rank appear in a variety of applications such as video processing (d = 3) [15] , time-dependent 3D imaging (d = 4), ray tracing where the material dependent bidirectional reflection function is an order four tensor that has to be determined from measurements [15] , numerical solution of the electronic Schrödinger equation ( $d=3N$ , where N is the number of particles) [16] [17] [18] , machine learning [19] and more. More specifically, suppose we are given a data tensor $\mathcal{X}\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}\times \cdots \times {n}_{d}}$ ( $d$ is the number of ways), and it can be decomposed as

$\mathcal{X}\mathrm{=}{\mathcal{A}}_{0}+{\mathcal{E}}_{0}+{\mathcal{F}}_{0}$ (3)

where ${\mathcal{A}}_{0}$ is low-rank and ${\mathcal{E}}_{0}$ is sparse. ${\mathcal{F}}_{0}$ is Gaussian noise with the noise level being $\delta $ . Then, we try to recover the low-rank ${\mathcal{A}}_{0}$ through the following convex relaxation problem:

$\begin{array}{l}\underset{A\mathrm{,}\mathcal{E}}{\mathrm{min}}{\Vert \mathcal{A}\Vert}_{*}+\lambda {\Vert \mathcal{E}\Vert}_{1}\\ \text{s}\text{.t}\text{.}{\Vert {\mathcal{P}}_{\Omega}\left(\mathcal{X}-\mathcal{A}-\mathcal{E}\right)\Vert}_{F}\le \delta \end{array}$ (4)

1.1. Related Work

Although the recovery of low-rank matrix has been well studied, the research of low-rank tensor recovery is still lacking. This is mainly because it’s difficult to define a satisfactory tensor rank which enjoys similar good properties as the matrix case. Several different definitions of tensor rank have been proposed but each has its limitation. For example, the CP rank [20] is defined as the smallest number of rank one tensors that sum up to $\mathcal{X}$ , where a rank one tensor is of the form ${u}_{1}\otimes {u}_{2}\otimes \cdots \otimes {u}_{d}$ . Expectedly, the CP rank is NP-hard to compute. Its convex relaxation is also intractable. Another more popular direction is to use the tractable Tucker rank [20] and its convex relaxation. For a d-way tensor $\mathcal{X}$ , the Tucker rank is a vector defined as

${\text{rank}}_{tc}\left(\mathcal{X}\right)\mathrm{:=}\left(\text{rank}\left({X}^{\left(1\right)}\right)\mathrm{,}\text{rank}\left({X}^{\left(2\right)}\right)\mathrm{,}\cdots \mathrm{,}\text{rank}\left({X}^{\left(d\right)}\right)\right)$ ,

where ${X}^{\left(i\right)}$ is the mode-i matricization of $\mathcal{X}$ . Motivated by the fact that the nuclear norm is the convex envelop of the matrix rank within the unit ball of the spectral norm. The Sum of Nuclear Norms (SNN), defined as $\sum}_{i}}{\Vert {X}^{\left(i\right)}\Vert}_{\mathrm{*$ , is used as a convex surrogate of the Tucker rank. This approach is effective, but SNN is not a tight convex relaxation of Tucker rank.

More recently, the work [21] proposed the tensor tubal rank based on a new tensor decomposition scheme denoted as t-SVD. The t-SVD is based on a new tensor-tensor product which enjoys many similar properties as the matrix case. Based on the computable t-SVD, the tensor nuclear norm [22] is used to replace the tubal rank for low-rank tensor recovery (from incomplete/corrupted tensors). The problem of recovering the unknown low-rank tensor $\mathcal{F}$ from incomplete samples ${\mathcal{F}}_{0}$ can be solved by the following convex program [21] ,

$\begin{array}{l}\underset{\mathcal{A}}{\mathrm{min}}{\Vert \mathcal{A}\Vert}_{*}\\ \text{s}\text{.t}.\text{}{\mathcal{P}}_{\Omega}\left(\mathcal{A}\right)={\mathcal{P}}_{\Omega}\left({\mathcal{A}}_{0}\right)\end{array}$ (5)

where ${\Vert \cdot \Vert}_{\mathrm{*}}$ is the nuclear norm of $\mathcal{A}$ , $\Omega $ is the index set of known elements in the original tensor, and ${\mathcal{P}}_{\Omega}$ is the projector onto the span of tensors. Lu et al. [23] extended the work and studied the Tensor Robust Principal Component (TRPCA) problem, as defined in the following equation:

$\begin{array}{l}\underset{A\mathrm{,}\mathcal{E}}{\mathrm{min}}{\Vert \mathcal{A}\Vert}_{*}+\lambda {\Vert \mathcal{E}\Vert}_{1}\\ \text{s}\text{.t}\text{.}\mathcal{X}=\mathcal{A}+\mathcal{E}\end{array}$ (6)

In this work, we go one step further, and consider recovering low-rank and sparse components of tensors from incomplete and noisy observations as defined in Equation (4).

1.2. Paper Contribution

The contributions of this work are two-fold:

・ A unified convex relaxation framework is proposed for the problem of recovering low-rank and sparse components of tensors from incomplete and noisy observations. Three augmented-Lagrangian-based algorithms are developed for the optimization problem.

・ Numerical experiments on synthetical data validate the efficacy of our proposed denoising approach.

1.3. Paper Organization

The rest of the paper is organized as follows. In Section 2, some preliminaries that are useful for the subsequent analysis are provided. In Section 3, three augmented-Lagrangian-based methods are developed for the problem defined in Equation (4). In Section 4, some numerical experiments verify the justification of the model defined in Equation (4) and the efficiency of the proposed algorithms. Finally, in Section 5, we make some conclusions and discuss some topics for future work.

2. Preliminaries

In this section, we list some lemmas concerning the shrinkage operators, which will be used at each iteration of the proposed augmented Lagrangian type methods to solve the generated subproblems.

Lemma 1. For $\tau >0$ , and $T\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}}$ , the solution of the following problem (7) obeys

$\underset{S}{{\displaystyle \mathrm{arg}\mathrm{min}}}\left\{\frac{1}{2}{\Vert S-T\Vert}_{F}^{2}+\tau {\Vert S\Vert}_{1}\right\}$ (7)

is given by $\text{shrink}\left(T\mathrm{,}\tau \right)$ . $\text{shrink}\left(\cdot \mathrm{,}\cdot \right)$ is a soft shrinkage operator and defined as:

$\text{shrink}\left(a\mathrm{,}\kappa \right)\{\begin{array}{l}a-\kappa \text{\hspace{1em}}\text{}a>\kappa \\ 0\text{\hspace{1em}}\text{}\left|a\right|\le \kappa \\ a+\kappa \text{\hspace{1em}}\text{}a<-\kappa \end{array}$ (8)

Lemma 2. Consider the singular value decomposition (SVD) of a matrix $A\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}}$ of rank $r$ .

$A=Q\mathrm{*}S\mathrm{*}V\mathrm{,}\text{\hspace{1em}}S=\text{diag}\left({\left\{{\sigma}_{i}\right\}}_{1\le i\le r}\right)$ (9)

where $Q\in {\mathbb{R}}^{{n}_{1}\times r}$ and $V\in {\mathbb{R}}^{{n}_{2}\times r}$ are orthogonal, and the singular values ${\sigma}_{i}$ are real and positive. Then, for all $\tau >0$ , define the soft-thresholding operator $\mathcal{D}$ ,

${\mathcal{D}}_{\tau}\left(A\right)\mathrm{:}=Q\mathrm{*}{\mathcal{D}}_{\tau}\left(S\right)\mathrm{*}V\mathrm{,}{\mathcal{D}}_{\tau}\left(S\right)=\text{diag}\left({\left\{{\left({\sigma}_{i}-\tau \right)}_{+}\right\}}_{1\le i\le r}\right)$ (10)

where ${x}_{+}$ is the operator that ${x}_{+}=\mathrm{max}\left(0,x\right)$ . Then, for each $\tau >0$ and $B\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}}$ , the singular value shrinkage operator (10) obeys

${\mathcal{D}}_{\tau}\left(A\right)=\underset{B}{{\displaystyle \mathrm{arg}\mathrm{min}}}\left\{\frac{1}{2}{\Vert B-A\Vert}_{F}^{2}+\tau {\Vert B\Vert}_{\mathrm{*}}\right\}$ (11)

3. Algorithm

An alternative model to study the problem defined in Equation (4) is the following nuclear-norm- and ${l}_{1}$ -norm- normalized least squares problem:

$\underset{A\mathrm{,}\mathcal{E}}{\mathrm{min}}{\Vert \mathcal{A}\Vert}_{\mathrm{*}}+{\lambda}_{1}{\Vert \mathcal{E}\Vert}_{1}+{\lambda}_{2}{\Vert {\mathcal{P}}_{\Omega}\left(\mathcal{X}-\mathcal{A}-\mathcal{E}\right)\Vert}_{F}^{2}$ (12)

Equation (12) can be reformulated into the following favourable form:

$\begin{array}{l}\underset{A\mathrm{,}\mathcal{E},\mathcal{F}}{\mathrm{min}}{\Vert \mathcal{A}\Vert}_{\mathrm{*}}+{\lambda}_{1}{\Vert \mathcal{E}\Vert}_{1}+{\lambda}_{2}{\Vert {\mathcal{P}}_{\Omega}\left(\mathcal{F}\right)\Vert}_{F}^{2}\\ \text{s}\text{.t}.\text{}\mathcal{X}=\mathcal{A}+\mathcal{E}+\mathcal{F}\end{array}$ (13)

Alternating Direction Method of Multiplier (ADMM), which is an extension of ALM algorithm, can be used to solve the tensor recovery problem defined in (13). With given $\left({\mathcal{A}}^{k}\mathrm{,}{\mathcal{E}}^{k}\mathrm{,}{\mathcal{F}}^{k}\right)$ , the ADMM generate the new iterates via the following scheme:

$\{\begin{array}{l}{\mathcal{A}}^{k+1}=\underset{\mathcal{A}}{{\displaystyle \mathrm{arg}\mathrm{min}}}\left({\Vert \mathcal{A}\Vert}_{\mathrm{*}}+\frac{\beta}{2}\Vert \mathcal{X}-\mathrm{(}\mathcal{A}+{\mathcal{E}}^{k}+{\mathcal{F}}^{k}\mathrm{)}+\frac{{\Lambda}_{1}^{k}}{\beta}\Vert \right)\\ {\mathcal{E}}^{k+1}=\underset{\mathcal{E}}{{\displaystyle \mathrm{arg}\mathrm{min}}}\left({\lambda}_{1}{\Vert \mathcal{E}\Vert}_{1}+\frac{\beta}{2}\Vert \mathcal{X}-\left({\mathcal{A}}^{k+1}+\mathcal{E}+{\mathcal{F}}^{k}\right)+\frac{{\Lambda}_{1}^{k}}{\beta}\Vert \right)\\ {\mathcal{F}}^{k+1}=\underset{\mathcal{F}\in \mathcal{B}}{{\displaystyle \mathrm{arg}\mathrm{min}}}\left({\lambda}_{2}{\Vert \mathcal{F}\Vert}_{F}^{2}+\frac{\beta}{2}\Vert \mathcal{X}-\left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k+1}+\mathcal{F}\right)+\frac{{\Lambda}_{1}^{k}}{\beta}\Vert \right)\\ {\Lambda}_{1}^{k+1}={\Lambda}_{1}^{k}+\beta \left(\mathcal{X}-\left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k+1}+{\mathcal{F}}^{k+1}\right)\right)\end{array}$ (14)

See Algorithm 1 for the optimization details.

3.1. Stopping Criterion

It can be easily verified that the iterates generated by the proposed ADMM algorithm can be characterized by

$\{\begin{array}{l}0\in \partial {\Vert {\mathcal{A}}^{k+1}\Vert}_{\mathrm{*}}-\left[{\Lambda}_{1}^{k}-\beta \left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k}+{\mathcal{F}}^{k}-\mathcal{X}\right)\right]\\ 0\in \partial \left({\lambda}_{1}{\Vert {\mathcal{E}}^{k+1}\Vert}_{1}\right)-\left[{\Lambda}_{1}^{k}-\beta \left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k+1}+{\mathcal{F}}^{k}-\mathcal{X}\right)\right]\\ 0\in \partial \left({\lambda}_{2}{\Vert {\mathcal{F}}^{k+1}\Vert}_{F}^{2}\right)-\left[{\Lambda}_{1}^{k}-\beta \left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k+1}+{\mathcal{F}}^{k+1}-\mathcal{X}\right)\right]\\ {\Lambda}_{1}^{k+1}={\Lambda}_{1}^{k}-\left[\left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k+1}+{\mathcal{F}}^{k+1}\right)-\mathcal{X}\right]\end{array}$ (15)

which is equivalent to

$\{\begin{array}{l}0\in \partial {\Vert {\mathcal{A}}^{k+1}\Vert}_{\mathrm{*}}-{\Lambda}_{1}^{k+1}+\beta \left({\mathcal{E}}^{k}-{\mathcal{E}}^{k+1}\right)+\beta \left({\mathcal{F}}^{k}-{\mathcal{F}}^{k+1}\right)\\ 0\in \partial \left({\lambda}_{1}{\Vert {\mathcal{E}}^{k+1}\Vert}_{1}\right)-{\Lambda}_{1}^{k+1}+\beta \left({\mathcal{F}}^{k}-{\mathcal{F}}^{k+1}\right)\\ 0\in \partial \left({\lambda}_{2}{\Vert {\mathcal{F}}^{k+1}\Vert}_{F}^{2}\right)-{\Lambda}_{1}^{k+1}\\ {\Lambda}_{1}^{k+1}={\Lambda}_{1}^{k}-\beta \left[\left({\mathcal{A}}^{k+1}+{\mathcal{E}}^{k+1}+{\mathcal{F}}^{k+1}\right)-\mathcal{X}\right]\end{array}$ (16)

Algorithm 1. Optimization framework for problem defined in Equation (13).

Equation (16) shows that the distance of the iterates $\left({\mathcal{A}}^{k+1}\mathrm{,}{\mathcal{E}}^{k+1}\mathrm{,}{\mathcal{F}}^{k+1}\right)$ to the

solution $\left({\mathcal{A}}^{\mathrm{*}}\mathrm{,}{\mathcal{E}}^{\mathrm{*}}\mathrm{,}{\mathcal{F}}^{\mathrm{*}}\mathrm{,}{\Lambda}^{\mathrm{*}}\right)$ can be characterized by $\beta \left(\Vert {\mathcal{E}}^{k}-{\mathcal{E}}^{k+1}\Vert +\Vert {\mathcal{F}}^{k}-{\mathcal{F}}^{k+1}\Vert \right)$ and $\frac{1}{\beta}\Vert {\Lambda}_{1}^{k}-{\Lambda}_{1}^{k+1}\Vert $ . Thus, a straightforward stopping criterion for Algorithm 1 is:

$min\left\{\beta \left(\Vert {\mathcal{E}}^{k}-{\mathcal{E}}^{k+1}\Vert +\Vert {\mathcal{F}}^{k}-{\mathcal{F}}^{k+1}\Vert \right),\frac{1}{\beta}\Vert {\Lambda}_{1}^{k}-{\Lambda}_{1}^{k+1}\Vert \right\}\le \u03f5$ (17)

Here
$\u03f5$ is an infinitesimal number, e.g., 10^{−6}.

3.2. Convergence Analysis

In this subsection, we mainly analyze the convergence of ADMM for solving problem defined in Equation (13). We denote ${f}_{1}(\cdot )={\Vert \cdot \Vert}_{*}$ , ${f}_{2}(\cdot )={\Vert \cdot \Vert}_{F}^{2}$ , and ${f}_{3}(\cdot )={\Vert \cdot \Vert}_{1}$ . ${f}_{2}(\cdot )$ is strongly convex, while ${f}_{1}(\cdot )$ and ${f}_{3}(\cdot )$ are convex terms, but may not be strongly convex. The problem defined in Equation (13) can be reformulated as

$\begin{array}{l}\underset{\mathcal{A}\mathrm{,}\mathcal{E}\mathrm{,}\mathcal{F}}{\mathrm{min}}{f}_{1}\left(\mathcal{A}\right)+{\lambda}_{2}{f}_{2}\left(\mathcal{F}\right)+{\lambda}_{1}{f}_{3}\left(\mathcal{E}\right)\\ \text{s}\text{.t}\text{.}\chi =\mathcal{A}+\mathcal{E}+\mathcal{F}\end{array}$ (18)

Definition 1. (Convex and Strongly Convex) Let $f\mathrm{:}{\mathbb{R}}^{n}\to \left[-\infty \mathrm{,}+\infty \right]$ , if the domain of $f$ denoted by $f\mathrm{:}=\left\{x\in {\mathbb{R}}^{n}\mathrm{,}f\left(x\right)<+\infty \right\}$ is not empty, $f$ is considered to be proper. If for any $x\in {\mathbb{R}}^{n}$ and $y\in {\mathbb{R}}^{n}$ , we always have $f\left(tx+\left(1-t\right)y\right)\le tf\left(x\right)+\left(1-t\right)f\left(y\right)\mathrm{,}\forall t\in \left[\mathrm{0,1}\right]$ , then it is considered that $f$ is convex. Furthermore, $f$ is considered to be strongly convex with the modulus $\mu >0$ , if

$\begin{array}{l}f\left(tx+\left(1-t\right)y\right)\le tf\left(x\right)+\left(1-t\right)f\left(y\right)\\ -\frac{1}{2}\mu t\left(1-t\right){\Vert x-y\Vert}^{2}\mathrm{,}\forall t\in \left[\mathrm{0,1}\right]\end{array}$ (19)

Cai et al. [24] and Li et al. [25] have proved the convergence of Extended Alternating Direction Method of Multipliers (e-ADMM) with only one strongly convex function for the case m = 3.

Assumption 1. In Equation (18), ${f}_{1}$ and ${f}_{3}$ are convex, and ${f}_{2}$ is strongly convex with modulus ${\mu}_{2}>0$ .

Assumption 2. The optimal solution set for the problem defined in Equation (18) is nonempty, i.e., there exist $\left({\mathcal{A}}^{\mathrm{*}}\mathrm{,}{\mathcal{E}}^{\mathrm{*}}\mathrm{,}{\mathcal{F}}^{\mathrm{*}}\mathrm{,}{\Lambda}_{1}^{\mathrm{*}}\right)\in {\Omega}^{\mathrm{*}}$ such that the following requirements can be satisfied:

$\nabla {f}_{1}\left({\mathcal{A}}^{\mathrm{*}}\right)-{\Lambda}^{\mathrm{*}}=0\mathrm{,}$ (20)

${\lambda}_{2}\nabla {f}_{2}\left({\mathcal{F}}^{\mathrm{*}}\right)-{\Lambda}_{1}^{\mathrm{*}}=0\mathrm{,}$ (21)

${\lambda}_{1}\nabla {f}_{3}\left({\mathcal{E}}^{\mathrm{*}}\right)-{\Lambda}_{1}^{\mathrm{*}}=0\mathrm{,}$ (22)

${\mathcal{A}}^{\mathrm{*}}+{\mathcal{E}}^{\mathrm{*}}+{\mathcal{F}}^{\mathrm{*}}-\chi =0\mathrm{,}$ (23)

Theorem 1. Assume that Assumption 1 and Assumption 2 hold. Let $\left({\mathcal{A}}^{k}\mathrm{,}{\mathcal{E}}^{k}\mathrm{,}{\mathcal{F}}^{k}\mathrm{,}{\Lambda}_{1}^{k}\right)$ be the sequence generated by Algorithm 1 for solving the problem

defined in Equation (18). If $\beta \in \left(0,\frac{6{\mu}_{2}}{13}\right)$ , the limit point of $\left({\mathcal{A}}^{k}\mathrm{,}{\mathcal{F}}^{k}\mathrm{,}{\mathcal{E}}^{k}\mathrm{,}{\Lambda}_{1}^{k}\right)$

is an optimal solution to Equation (18). Moreover, the objective function con- verges to the optimal value and the constraint violation converges to zero, i.e.,

$\underset{k\to \infty}{\mathrm{lim}}\Vert {f}_{1}\left({\mathcal{A}}^{\mathrm{*}}\right)+{\lambda}_{2}{f}_{2}\left({\mathcal{F}}^{\mathrm{*}}\right)+{\lambda}_{1}{f}_{3}\left({\mathcal{E}}^{\mathrm{*}}\right)-{f}^{\mathrm{*}}\Vert =0$ (24)

and

$\underset{k\to \infty}{\mathrm{lim}}\Vert \chi -\left(\mathcal{A}+\mathcal{E}+\mathcal{F}\right)\Vert =0$ (25)

where ${f}^{\mathrm{*}}$ denotes the optimal objective value for the problem defined in Equa-

tion (18). In our specific application, $\beta \in \left(0,\frac{6*2{\lambda}_{2}}{13}\right)$ can sufficiently ensure

the convergence [24] .

3.3. Parameter Choice

In our optimization framework given in Equation (13), there are three parameters $\beta $ , ${\lambda}_{1}$ and ${\lambda}_{2}$ . As mentioned in Lu [23] , ${\lambda}_{1}$ does not need any tuning and can be set to $1/\sqrt{{n}_{\left(1\right)}{n}_{3}}$ , where ${n}_{\left(1\right)}=\mathrm{max}\left\{{n}_{1},{n}_{2}\right\}$ . Besides, the value of $\beta $

is limited to the range $\beta \in \left(\mathrm{0,}\frac{\mathrm{6*2}{\lambda}_{2}}{13}\right)$ to ensure the convergence of our algo-

rithm (based on the analysis in Theorem 1). Thus, the value of ${\lambda}_{2}$ is important for the performance of our algorithm. For simplicity, we consider the case when $\mathcal{A}$ is only degraded by Gaussian noise $\mathcal{F}$ without sparse noise $\mathcal{E}$ , that is:

$min\text{\hspace{0.05em}}\frac{1}{2}{\Vert \mathcal{F}\Vert}_{F}^{2}+\frac{1}{2{\lambda}_{2}}{\Vert \mathcal{A}\Vert}_{\mathrm{*}}\text{s}\text{.t}\mathrm{.}\text{}\chi =\mathcal{A}+\mathcal{F}$ (26)

The solution for Equation (26) is equal to $\chi $ but with singular values being shifted towards zero by soft thresholding. ${\lambda}_{2}$ should be set large enough to remove noise (i.e., to keep the variance low), and small to avoid over-shrinking of the original tensor A (i.e., to keep the bias low). For the matrix case (i.e., ${n}_{3}=1$ ), Candes et al. [26] have deduced the proper value of ${\lambda}_{2}$ , as shown in the following theorem.

Theorem 2. Supposing that the Gaussian noise term $F\in {\mathbb{R}}^{n\times n}$ , and each entry ${n}_{i\mathrm{,}j}$ is iid normally distributed, we can have that for $\mathcal{N}\left(\mathrm{0,}{\sigma}^{2}\right)$ ,

${\Vert F\Vert}_{F}^{2}\le \left(n+\sqrt{8n}\right){\sigma}^{2}$ with high probability. Then, $\frac{1}{2{\lambda}_{1}}=\sqrt{\left(n+\sqrt{8n}\right)}\sigma $ . That is, ${\lambda}_{2}=\frac{1}{2\sqrt{\left(n+\sqrt{8n}\right)}\sigma}$ .

Based on this conclusion, we derive the required conditions for convex program defined in Equation (13) to accurately recover the low-rank component $\mathcal{A}$ from corrupted observations. Our derivations are given in the following main result.

Main Result 1. Assume that the low-rank tensor ${\mathcal{A}}_{0}\in {\mathbb{R}}^{{n}_{1}\times {n}_{2}\times {n}_{3}}$ obeys the incoherence conditions [27] . The support set Ω of ${\mathcal{E}}_{0}$ is uniformly distributed among all sets of cardinality m. Then, there exist universal constants ${c}_{1},{c}_{2}>0$ such that with probability at least $1-{c}_{1}{n}^{-{c}_{2}}$ , ${\mathcal{A}}_{0}\mathrm{,}{\mathcal{E}}_{0}$ is the unique minimizer to problem defined in Equation (13). The values of ${\lambda}_{1}$ and ${\lambda}_{2}$ can be determined

as ${\lambda}_{2}=\frac{1}{2\sqrt{{n}_{\left(1\right)}{n}_{3}+\sqrt{8{n}_{\left(1\right)}{n}_{3}}}\sigma}$ and ${\lambda}_{1}=\frac{1}{\sqrt{{n}_{\left(1\right)}{n}_{3}}}$ . In the same time, the rank of

${\mathcal{A}}_{0}$ and the number of non-zero entries of ${\mathcal{E}}_{0}$ should satisfy that

${\text{rank}}_{t}\left(\mathcal{A}\right)\le \frac{{\rho}_{r}{n}_{\left(2\right)}}{{\mu}_{0}{\left(\mathrm{log}\left({n}_{\left(1\right)}{n}_{3}\right)\right)}^{2}}\text{and}m\le {\rho}_{s}{n}_{1}{n}_{2}{n}_{3}^{2}$

where
${n}_{\left(1\right)}=\mathrm{max}\left\{{n}_{1},{n}_{2}\right\}$ and
${n}_{\left(2\right)}=\mathrm{min}\left\{{n}_{1},{n}_{2}\right\}$ . ρ_{r} and ρ_{s} are positive constants.

The value of penalty parameter β should be within the range of $\left(0,\frac{6*2{\lambda}_{2}}{13}\right)$ to

ensure the convergence.

4. Experiments on Synthetic Data

In this section, we conduct synthetic data and real data experiments to corroborate our algorithm. We investigate the ability of our proposed Robust Low Rank Tensor Approximation (denoted as RLRTA) algorithm for recovering low-rank tensors of various tubal rank from noises of various sparsity and random Gaussian noise of different intensity.

4.1. Exact Recovery from Varying Sparsity of $\mathcal{E}$

We first verify the correct recovery performance of our algorithm for different sparsity of $\mathcal{E}$ . Be similar to [23] , we consider the tensors of size $n\times n\times n$ , with varying dimensions $n=100,200$ . We generate a ${\text{rank}}_{t}-r$ tensor $\mathcal{A}=\mathcal{Q}\mathrm{*}\mathcal{V}$ , where the entries of $\mathcal{Q}\in {\mathbb{R}}^{n\times r\times n}$ and $\mathcal{V}\in {\mathbb{R}}^{r\times n\times n}$ are independently sampled from a uniform distribution in interval $\left(\mathrm{0,}1/n\right)$ . The support set $\Omega $ , with size $m={\rho}_{s}{n}^{3}$ of sparse component $\mathcal{E}$ is chosen uniformly at random. For all $\left(i,j,k\right)\in \Omega $ , let ${\mathcal{E}}_{ijk}={\mathcal{M}}_{ijk}$ , where $\mathcal{M}$ is a tensor with independent Bernoulli $\pm 1$ entries. For $\mathcal{E}$ , it can be mathematically expressed as

${\mathcal{E}}_{ijk}=\{\begin{array}{l}\mathrm{1,}\text{w}\text{.p}\mathrm{.}{\rho}_{s}/2\\ 0\text{w}\text{.p}.1-{\rho}_{s}\\ -1\text{w}\text{.p}\mathrm{.}{\rho}_{s}/2\end{array}$ (27)

where $\text{w}\text{.p}\text{.}$ is the abbreviation of “with probability”. We test on two settings: the first scenario with setting $r={\text{rank}}_{t}\left(\mathcal{A}\right)=0.1n$ and ${\rho}_{s}=0.1$ . The second scenario with setting $r={\text{rank}}_{t}\left(\mathcal{A}\right)=0.1n$ and ${\rho}_{s}=0.2$ .

The Gaussian noise $\mathcal{F}$ in each frontal slice is generated independently with each other, i.e.

$\mathcal{F}\left(:,:,{i}_{3}\right)\sim \mathcal{N}\left(0,{\sigma}_{{i}_{3}}^{2}\right)\mathrm{,}\text{}1\le {i}_{3}\le n$ (28)

The variance values ${\sigma}_{{i}_{3}}^{2}$ in each frontal slice are randomly selected from 0 to 0.1. In this sub-subsection 1, our task is to recovery $\mathcal{A}$ from noisy observation $\chi =\mathcal{A}+\mathcal{E}+\mathcal{F}$ with $\mathcal{E}$ of varying sparsity. Table 1 and Table 2 show the

Table 1. Correct recovery for random problems of varying sparsity using RLRTA.

Table 2. Correct recovery for random problems of varying sparsity using TRPCA [23] .

recovery results of algorithm RLRTA and TRPCA. It’s shown that RLRTA can better recover the low-rank compnent $\mathcal{A}$ under different sparse component $\mathcal{E}$ .

4.2. Exact Recovery with $\mathcal{F}$ of Varying Intensity

Now we exam the recovery phenomenon with Gaussian noise of varying variances. The generation of $\mathcal{A}\in {\mathbb{R}}^{n\times n\times n}$ is the same as that in sub-subsection 1 and $r={\text{rank}}_{t}\left(\mathcal{A}\right)=0.1n$ . The sparse component $\mathcal{E}$ has sparsity ${\rho}_{s}=0.1$ . For simplicity, we assume that $\mathcal{F}$ is white Gaussian noise, that is

$\mathcal{F}\left({i}_{1}\mathrm{,}{i}_{2}\mathrm{,}{i}_{3}\right)\sim \mathcal{N}\left(\mathrm{0,}{\sigma}_{w}^{2}\right)$ (29)

where $1\le {i}_{1}\le n,\text{}1\le {i}_{2}\le n,\text{}1\le {i}_{3}\le n$ . The noise variance values ${\sigma}_{w}^{2}$ are 0.02, 0.04, 0.06, 0.08 and 0.1, respectively Table 3 and Table 4 show the recovery results of algorithm RLRTA and TRPCA. It’s shown that RLRTA can better recover the low-rank compnent $\mathcal{A}$ under different Gaussian noise $\mathcal{F}$ .

4.3. Phase Transition in Rank and Sparsity

Now we exam the recovery phenomenon with varying rank of $\mathcal{A}$ and varying sparsity of $\mathcal{E}$ . Similar to [23] , we consider two sizes of $\mathcal{A}\in {\mathbb{R}}^{n\times n\times {n}_{3}}$ : (1)

Table 3. Correct recovery for random problems of varying intensity using RLRTA.

Table 4. Correct recovery for random problems of varying intensity using TRPCA [23] .

$n=100,$ ${n}_{3}=50$ , (2) $n=200,\text{}{n}_{3}=50$ . We generate $\mathcal{A}=\mathcal{Q}\mathrm{*}\mathcal{V}$ , where the entries of $\mathcal{Q}\in {\mathbb{R}}^{n\times r\times {n}_{3}}$ and $\mathcal{V}\in {\mathbb{R}}^{r\times n\times {n}_{3}}$ are independently sampled from a uniform distribution in interval $\left(\mathrm{0,}1/n\right)$ . For $\mathcal{E}$ , we still consider a Bernoulli model for its support and random signs as in Equation (27). The variance values ${\sigma}_{w\mathrm{,}{i}_{3}}^{2}$ in each frontal slice ${i}_{3}\left(1\le {i}_{3}\le {n}_{3}\right)$ are randomly selected from 0 to 0.1, and the mean variance values are both set to be 0.05.

We set
$r/n$ as all the choices in
$\left[\mathrm{0.01:0.01:0.5}\right]$ , and ρ_{s} in
$\left[\mathrm{0.01:0.01:0.5}\right]$ . For each
$\left(r\mathrm{,}{\rho}_{s}\right)$ -pair, we simulate 10 test instances and declare a trial to be successful if the recovered
$\stackrel{\u2323}{\mathcal{A}}$ satisfies
${\Vert \stackrel{\u2323}{\mathcal{A}}-\mathcal{A}\Vert}_{F}/{\Vert \mathcal{A}\Vert}_{F}\le {10}^{-3}$ . Figure 1 plots

(a)(b)

Figure 1. Correct recovery for varying rank and sparsity for RLRTA and TRPCA [23] . Fraction of correct recoveries, as a function of ${\text{rank}}_{t}\left(\mathcal{A}\right)$ ( $x$ -axis) and sparsity of $\mathcal{E}$ ( $y$ -axis).

the fraction of correct recovery for each pair (black = 0% and white = 100%). It can be seen that there is a large region in which the recovery is correct.

4.4. Phase Transition in Rank and Entry-Wise Noise Intensity

Now we exam the recovery phenomenon with varying rank of $\mathcal{A}$ and varying intensity of noise $\mathcal{F}$ . We still consider two sizes of $\mathcal{A}\in {\mathbb{R}}^{n\times n\times {n}_{3}}$ : (1) $n=100,$ ${n}_{3}=50$ , (2) $n=200,\text{}{n}_{3}=50$ . We generate $\mathcal{A}=\mathcal{Q}\mathrm{*}\mathcal{V}$ , where the entries of $\mathcal{Q}\in {\mathbb{R}}^{n\times r\times {n}_{3}}$ and $\mathcal{V}\in {\mathbb{R}}^{r\times n\times {n}_{3}}$ independently sampled from a uniform distribution in interval $\left(\mathrm{0,}1/n\right)$ . For $\mathcal{E}$ , we still consider a Bernoulli model for its

(a)(b)

Figure 2. Correct recovery for varying rank and noise variance for RLRTA and TRPCA [23] . Fraction of correct recoveries, as a function of ${\text{rank}}_{t}\left(\mathcal{A}\right)$ ( $x$ -axis) and variance of $\mathcal{F}$ ( $y$ -axis).

support and random signs as in Equation (27) and sparsity parameter ρ_{s} is fixed at 0.1. The generation of
$\mathcal{F}$ is similar to Equation (29).

We set $r/n$ as all the choices in $\left[\mathrm{0.01:0.05:0.5}\right]$ . The noise variance values ${\sigma}_{w}^{2}$ are in $\left[\mathrm{0.01:0.01:0.1}\right]$ . For each $\left(r\mathrm{,}{\sigma}_{w}^{2}\right)$ -pair, we simulate 10 test instances and declare a trial to be successful if the recovered $\stackrel{\u2323}{\mathcal{A}}$ satisfies ${\Vert \stackrel{\u2323}{\mathcal{A}}-\mathcal{A}\Vert}_{F}/{\Vert \mathcal{A}\Vert}_{F}\le {10}^{-3}$ . Figure 2 plots the fraction of correct recovery for each pair (black = 0% and white = 100%). It can be seen that there is a large region in which the recovery is correct.

5. Conclusion

This work verifies the ability of convex optimization for the recovery of low- rank tensors corrupted by both impulse and Gaussian noise. The problem is tackled by integrating the tensor nuclear norm, ${l}_{1}$ -norm and least square term in a unified convex relaxation framework. Parameters are selected to comprise the low-rank component, the sparse component and the Gaussian-noise term. Besides, the convergence of the proposed algorithm is discussed. Numerical experiments have been conducted to demonstrate the efficacy of our proposed denoising approach.

Acknowledgements

The authors would like to thank Canyi Lu for providing the code for TRPCA algorithm.

References

[1] Jolliffe, I. (2002) Principal Component Analysis. Wiley Online Library.

[2] Candes, E.J., Li, X.D., Ma, Y. and Wright, J. (2011) Robust Principal Component Analysis? Journal of the ACM, 58, Article Number: 11.

https://doi.org/10.1145/1970392.1970395

[3] Chandrasekaran, V., Sanghavi, S., Parrilo, P.A. and Willsky, A.S. (2011) Rank-Sparsity Incoherence for Matrix Decomposition. SIAM Journal on Optimization, 21, 572-596.

https://doi.org/10.1137/090761793

[4] Wright, J., Ganesh, A., Rao, S.Y.P. and Ma, Y. (2009) Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization Neural Information Processing Systems (NIPS).

[5] Tao, M. and Yuan, X. (2011) Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations. SIAM Journal on Optimization, 21, 57-81.

https://doi.org/10.1137/100781894

[6] Candes, E.J. and Recht, B. (2009) Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 9, 717-772.

https://doi.org/10.1007/s10208-009-9045-5

[7] Candes, E.J. and Tao, T. (2010) The Power of Convex Relaxation: Near-Optimal Matrix Completion. IEEE Transaction on Information Theory, 56, 2053-2080.

https://doi.org/10.1109/TIT.2010.2044061

[8] Wang, H. and Ahuja, N. (2004) Compact Representation of Multidimensional Data Using Tensor Rank-One Decomposition.

[9] Bloy, L. and Verma, R. (2008) On Computing the Underlying Fiber Directions from the Diffusion Orientation Distribution Function. 11th International Conference on Medical Image Computing and Computer-Assisted Intervention, New York, 6-10 September 2008, 1-8.

[10] Ghosh, A., Tsigaridas, E., Descoteaux, M., Comon, P., Mourrain, B. and Deriche, R. (2008) A Polynomial Based Approach to Extract the Maxima of an Antipodally Symmetric Spherical Function and Its Application to Extract Fiber Directions from the Orientation Distribution Function in Diffusion MRI.

[11] Qi, L., Yu, G. and Wu, E.X. (2010) Higher Order Positive Semi-Definite Diffusion Tensor Imaging. SIAM Journal on Imaging Sciences, 3, 416-433.

https://doi.org/10.1137/090755138

[12] Hilling, J.J. and Sudbery, A. (2010) The Geometric Measure of Multipartite Entanglement and the Singular Values of a Hypermatrix. Journal of Mathematical Physics, 51, Article ID: 072102.

https://doi.org/10.1063/1.3451264

[13] Hu, S. and Qi, L. (2012) Algebraic Connectivity of an Even Uniform Hypergraph. Journal of Combinatorial Optimization, 24, 564-579.

https://doi.org/10.1007/s10878-011-9407-1

[14] Li, W. and Ng, M. (2011) Existence and Uniqueness of Stationary Probability Vector of a Transition Probability Tensor Technical Report. Department of Mathematics, the Hong Kong Baptist University, Hong Kong.

[15] Liu, J., Musialski, P., Wonka, P. and Ye, J. (2009) Tensor Completion for Estimating Missing Values in Visual Data In ICCV.

[16] Lubich, C. (2008) From Quantum to Classical Molecular Dynamics: Reduced Models and Numerical Analysis EMS Zürich.

[17] Beck, M.H., Jackle, A., Worth, G.A. and Meyer, H.D. (1999) The Multiconfiguration Time-Dependent Hartree (MCTDH) Method: A Highly Efficient Algorithm for Propagating Wavepackets. Physics Reports, 324, 1-105.

https://doi.org/10.1016/S0370-1573(99)00047-2

[18] Wang, H. and Thoss, M. (2009) Numerically Exact Quantum Dynamics for Indistinguishable Particles: The Multilayer Multiconfiguration Time-Dependent Hartree Theory in Second Quantization Representation. Journal of Chemical Physics, 131, Article ID: 024114.

[19] Paredes, B.R., Aung, H., Berthouze, N.B. and Pontil, M. (2013) Multilinear Multitask Learning. Journal of Machine Learning Research, 28, 1444-1452.

https://doi.org/10.1063/1.3173823

[20] Kolda, T.G. and Bader, B.W. (2009) Tensor Decompositions and Applications. SIAM Review, 51, 455-500.

https://doi.org/10.1137/07070111X

[21] Zhang, Z., Ely, G., Aeron, S., Hao, N. and Kilmer, M.E. (2014) Novel Methods for Multilinear Data Completion and De-Noising Based on Tensor-SVD. Computer Science, 44, 3842-3849.

https://doi.org/10.1109/cvpr.2014.485

[22] Semerci, O., Hao, N., Kilmer, M.E. and Miller, E.L. (2013) Tensor-Based Formulation and Nuclear Norm Regularization for Multienergy Computed Tomography. IEEE Transactions on Image Processing, 23, 1678-1693.

https://doi.org/10.1109/TIP.2014.2305840

[23] Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z. and Yan, S. (2016) Tensor Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Tensors via Convex Optimization.

[24] Cai, X., Han, D. and Yuan, X. (2014) The Direct Extension of ADMM for Three-Block Separable Convex Minimization Models Is Convergent When One Function Is Strongly Convex.

[25] Li, M., Sun, D. and Toh, K. (2014) A Convergent 3-Block Semi-Proximal ADMM for Convex Minimization Problems with One Strongly Convex Block. Asia Pacific Journal of Operational Research, 32, 1550024.

https://doi.org/10.1142/S0217595915500244

[26] Candes, E.J. and Plan, Y. (2010) Matrix Completion with Noise. Proceedings of the IEEE, 98, 925-936.

https://doi.org/10.1109/JPROC.2009.2035722

[27] Zhang, Z. and Aeron, S. (2015) Exact Tensor Completion Using t-SVD.