Talk

Optimal CUR Matrix Decompositions
Christos Boutsidis1
Yahoo! Labs
New York
David P. Woodruff 2
IBM Research
Almaden
Singular Value Decomposition
m × n matrix A
k < ρ = rank(A)
Low-rank matrix approximation problem:
min
X∈Rm×n ,rank(X)≤k
||A − X||F
Singular Value Decomposition (SVD):
!
Σk
VTk
0
T
A = U · Σ · V = Uk Uρ−k
0 Σρ−k
VTρ−k
|
{z
}
|
{z
}|
m×ρ
{z
}
ρ×ρ
ρ×n
Uk ∈ Rm×k , Σk ∈ Rk×k , and Vk ∈ Rn×k
Solution via Eckart-Young Theorem
Ak = Uk Σk VkT = AVk VkT .
O(mn min{m, n}) time
CUR Matrix Decomposition
CUR replaces the left and right singular vectors in the SVD
with actual columns and rows from the matrix, respectively










 

 
= C 

 

 
A







 

 

 =  Uk 
 

 

A

U
R

Σk
Vk




+







E




+







E
Motivation










 

 
 ≈  users 

 

 
user-by-movie
U
A: users-by-movies ratings matrix.
C: contains the most “important” users.
R: contains the most “important” movies.
movies
Optimization problem
Definition (The CUR Problem)
Given
A ∈ Rm×n
k < rank(A)
ε>0
construct
C ∈ Rm×c
R ∈ Rr ×n
U ∈ Rc×r
such that:
kA − CURk2F ≤ (1 + ε) · kA − Ak k2F .
with c, r , and rank(U) being as small as possible.
Prior art
Sub-optimal and randomized algorithms.
1
2
3
4
5
c
k/ε2
k/ε4
(k log k)/ε2
(k log k)/ε2
k/ε
r
k/ε
k/ε2
(k log k)/ε4
(k log k)/ε2
k/ε2
rank(U)
k
k
(k log k)/ε2
(k log k)/ε2
k/ε
kA − CURk2F ≤
kA − Ak k2F + εkAk2F
kA − Ak k2F + εkAk2F
(1 + ε)kA − Ak k2F
(2 + ε)kA − Ak k2F
(1 + ε)kA − Ak k2F
Time
nnz(A)
nnz(A)
n3
n3
n2 k/ε
References:
1 Drineas and Kannan. Symposium on Foundations of Computer Science, 2003.
2 Drineas, Kannan, and Mahoney. SIAM Journal on Computing, 2006.
3 Drineas, Mahoney, and Muthukrishnan. SIAM Journal on Matrix Analysis, 2008.
4 Drineas and Mahoney. Proceedings of the National Academy of Sciences, 2009.
5 Wang and Zhang. Journal of Machine Learning Research, 2013.
Open problems
1
Optimal CUR: Can we find relative-error CUR algorithms
selecting the optimal number of columns and rows,
together with a matrix U with optimal rank?
2
Input-sparsity-time CUR: Can we find relative-error CUR
algorithms running in input-sparsity-time (nnz(A) time)?
3
Deterministic CUR: Can we find relative-error CUR
algorithms that are deterministic and run in poly time?
Contributions
1
Optimal CUR: First optimal CUR algorithms.
2
Input-sparsity-time CUR: First CUR algorithm with
running time proportional to the non-zero entries of A.
3
Deterministic CUR: First deterministic algorithm for CUR
that runs in polynomial time.
Lower bound
Theorem
Fix appropriate matrix A ∈ Rn×n . Consider a factorization CUR,
kA − CURk2F ≤ (1 + ε)kA − Ak k2F .
Then, for any k ≥ 1 and for any ε < 1/3:
c = Ω(k /ε),
and
r = Ω(k /ε),
and
rank(U) ≥ k/2.
Extended lower bound in [Deshpande and Vempala, 2006], [Boutsidis et al, 2011], [Sinop and Guruswami, 2011]
Input-sparsity-time CUR
Theorem
There exists a randomized algorithm to construct a CUR with
c = O(k /ε)
and
r = O(k /ε)
and
rank (U) = k
such that, with constant probability of success,
kA − CURk2F ≤ (1 + ε)kA − Ak k2F .
Running time: O (nnz (A) log n + (m + n) · poly (log n, k, 1/ε)) .
Deterministic CUR
Theorem
There exists a deterministic algorithm to construct a CUR with
c = O(k /ε)
and
r = O(k /ε)
and
rank (U) = k
such that
kA − CURk2F ≤ (1 + ε)kA − Ak k2F .
Running time: O(mn3 k/ε).
A proto-algorithm (step 1)
1
Construct C with O(k/ε) columns:
1
Compute/approx the top k singular vectors of A: Z1 ∈ Rn×k .
kA − AZ1 Z1T k2F ≤ (1 + ε) kA − Ak k2F
2
Sample O(k log k ) columns from Z1T (equivalently from A)
with leverage scores (Euclidean norms of columns of Z1T ).
Down-sample those columns to c1 = O(k ) columns with
Batson/Srivastava/Spielman (BSS) sampling (C1 ∈ Rm×c1 ).
T
T
−→ Z̃1
−→
Z1T
Ẑ1
|
| {z }
{z
}
|
{z
}
n
O(k log k)
O(k)
kA − C1 C†1 Ak2F ≤ O(1)kA − AZ1 Z1T k2F
3
Adaptively sample c2 = O(k /ε) columns of A:
kA − C · Dk2F ≤ kA − Ak k2F + (k /c2 )kA − C1 C1 † Ak2F
D is a rank k matrix
Step 2
2
Construct R with O(k/ε) rows:
1
Find Z2 ∈ Rm×k in the span of C such that:
kA − Z2 Z2T Ak2F ≤ (1 + ε) · kA − Ak k2F .
2
How to do this efficiently?
Instead of projecting columns of A onto C, we project the
columns of AW, where W is a random subspace embedding
Find best rank-k approximation of the columns of AW in C
3
Sample O(k log k ) rows with leverage scores (from Z2 ).
Down-sample those rows to r1 = O(k ) rows with
Batson/Spielman/Srivastava (BSS) sampling. (R1 ∈ Rr1 ×n )
kA − AR†1 R1 k2F ≤ O(1)kA − Z2 Z2T Ak2F
4
Sample r2 = O(k /ε) rows with adaptive sampling++
kA−Z2 Z2T AR† Rk2F ≤ kA−Z2 Z2T Ak2F +
rank(Z2 Z2T A)
kA−AR1 † R1 k2F
r2
Input-sparsity-time CUR
Everything should run in nnz(A) log n + poly(k , 1/ε) time.
1
Existing tools:
Input-sparsity-time SVD [Clarkson, W, STOC 2013].
Leverage-scores sampling [Drineas et al, SIMAX 2008].
Input-sparsity-time algorithm to find the “best” rank k approx
to a matrix in a given subspace [Kannan et al, COLT 2014].
2
New tools:
Input-sparsity-time version of the BSS sampling method of
[Boutsidis et al, 2011].
Input-sparsity-time version of adaptive sampling method
[Desphande et al, 2006, Wang and Zhang, 2013].
Input-sparsity-time construction of U.
Deterministic CUR
Everything should run in polynomial time and be deterministic.
1
Existing tools:
Standard SVD algorithm.
Standard method to find the “best” rank k approximation to
a matrix in a given subspace.
Batson/Spielman/Srivastava (BSS) sampling as in
[Boutsidis et all, FOCS 2011].
2
New tools:
Derandomization of the adaptive sampling of [Desphande
et al, RANDOM 2006] and [Wang and Zhang, JMLR 2013].
Conclusions
- Optimal, (1 + ε)-error CUR with O(k/ε) columns/rows.
- Input-sparsity-time algorithm.
- Deterministic polynomial-time algorithm.
- Extended abstract appeared in STOC 2014.
- Full version on ArXiv, June 2014.