Optimal CUR Matrix Decompositions Christos Boutsidis1 Yahoo! Labs New York David P. Woodruff 2 IBM Research Almaden Singular Value Decomposition m × n matrix A k < ρ = rank(A) Low-rank matrix approximation problem: min X∈Rm×n ,rank(X)≤k ||A − X||F Singular Value Decomposition (SVD): ! Σk VTk 0 T A = U · Σ · V = Uk Uρ−k 0 Σρ−k VTρ−k | {z } | {z }| m×ρ {z } ρ×ρ ρ×n Uk ∈ Rm×k , Σk ∈ Rk×k , and Vk ∈ Rn×k Solution via Eckart-Young Theorem Ak = Uk Σk VkT = AVk VkT . O(mn min{m, n}) time CUR Matrix Decomposition CUR replaces the left and right singular vectors in the SVD with actual columns and rows from the matrix, respectively = C A = Uk A U R Σk Vk + E + E Motivation ≈ users user-by-movie U A: users-by-movies ratings matrix. C: contains the most “important” users. R: contains the most “important” movies. movies Optimization problem Definition (The CUR Problem) Given A ∈ Rm×n k < rank(A) ε>0 construct C ∈ Rm×c R ∈ Rr ×n U ∈ Rc×r such that: kA − CURk2F ≤ (1 + ε) · kA − Ak k2F . with c, r , and rank(U) being as small as possible. Prior art Sub-optimal and randomized algorithms. 1 2 3 4 5 c k/ε2 k/ε4 (k log k)/ε2 (k log k)/ε2 k/ε r k/ε k/ε2 (k log k)/ε4 (k log k)/ε2 k/ε2 rank(U) k k (k log k)/ε2 (k log k)/ε2 k/ε kA − CURk2F ≤ kA − Ak k2F + εkAk2F kA − Ak k2F + εkAk2F (1 + ε)kA − Ak k2F (2 + ε)kA − Ak k2F (1 + ε)kA − Ak k2F Time nnz(A) nnz(A) n3 n3 n2 k/ε References: 1 Drineas and Kannan. Symposium on Foundations of Computer Science, 2003. 2 Drineas, Kannan, and Mahoney. SIAM Journal on Computing, 2006. 3 Drineas, Mahoney, and Muthukrishnan. SIAM Journal on Matrix Analysis, 2008. 4 Drineas and Mahoney. Proceedings of the National Academy of Sciences, 2009. 5 Wang and Zhang. Journal of Machine Learning Research, 2013. Open problems 1 Optimal CUR: Can we find relative-error CUR algorithms selecting the optimal number of columns and rows, together with a matrix U with optimal rank? 2 Input-sparsity-time CUR: Can we find relative-error CUR algorithms running in input-sparsity-time (nnz(A) time)? 3 Deterministic CUR: Can we find relative-error CUR algorithms that are deterministic and run in poly time? Contributions 1 Optimal CUR: First optimal CUR algorithms. 2 Input-sparsity-time CUR: First CUR algorithm with running time proportional to the non-zero entries of A. 3 Deterministic CUR: First deterministic algorithm for CUR that runs in polynomial time. Lower bound Theorem Fix appropriate matrix A ∈ Rn×n . Consider a factorization CUR, kA − CURk2F ≤ (1 + ε)kA − Ak k2F . Then, for any k ≥ 1 and for any ε < 1/3: c = Ω(k /ε), and r = Ω(k /ε), and rank(U) ≥ k/2. Extended lower bound in [Deshpande and Vempala, 2006], [Boutsidis et al, 2011], [Sinop and Guruswami, 2011] Input-sparsity-time CUR Theorem There exists a randomized algorithm to construct a CUR with c = O(k /ε) and r = O(k /ε) and rank (U) = k such that, with constant probability of success, kA − CURk2F ≤ (1 + ε)kA − Ak k2F . Running time: O (nnz (A) log n + (m + n) · poly (log n, k, 1/ε)) . Deterministic CUR Theorem There exists a deterministic algorithm to construct a CUR with c = O(k /ε) and r = O(k /ε) and rank (U) = k such that kA − CURk2F ≤ (1 + ε)kA − Ak k2F . Running time: O(mn3 k/ε). A proto-algorithm (step 1) 1 Construct C with O(k/ε) columns: 1 Compute/approx the top k singular vectors of A: Z1 ∈ Rn×k . kA − AZ1 Z1T k2F ≤ (1 + ε) kA − Ak k2F 2 Sample O(k log k ) columns from Z1T (equivalently from A) with leverage scores (Euclidean norms of columns of Z1T ). Down-sample those columns to c1 = O(k ) columns with Batson/Srivastava/Spielman (BSS) sampling (C1 ∈ Rm×c1 ). T T −→ Z̃1 −→ Z1T Ẑ1 | | {z } {z } | {z } n O(k log k) O(k) kA − C1 C†1 Ak2F ≤ O(1)kA − AZ1 Z1T k2F 3 Adaptively sample c2 = O(k /ε) columns of A: kA − C · Dk2F ≤ kA − Ak k2F + (k /c2 )kA − C1 C1 † Ak2F D is a rank k matrix Step 2 2 Construct R with O(k/ε) rows: 1 Find Z2 ∈ Rm×k in the span of C such that: kA − Z2 Z2T Ak2F ≤ (1 + ε) · kA − Ak k2F . 2 How to do this efficiently? Instead of projecting columns of A onto C, we project the columns of AW, where W is a random subspace embedding Find best rank-k approximation of the columns of AW in C 3 Sample O(k log k ) rows with leverage scores (from Z2 ). Down-sample those rows to r1 = O(k ) rows with Batson/Spielman/Srivastava (BSS) sampling. (R1 ∈ Rr1 ×n ) kA − AR†1 R1 k2F ≤ O(1)kA − Z2 Z2T Ak2F 4 Sample r2 = O(k /ε) rows with adaptive sampling++ kA−Z2 Z2T AR† Rk2F ≤ kA−Z2 Z2T Ak2F + rank(Z2 Z2T A) kA−AR1 † R1 k2F r2 Input-sparsity-time CUR Everything should run in nnz(A) log n + poly(k , 1/ε) time. 1 Existing tools: Input-sparsity-time SVD [Clarkson, W, STOC 2013]. Leverage-scores sampling [Drineas et al, SIMAX 2008]. Input-sparsity-time algorithm to find the “best” rank k approx to a matrix in a given subspace [Kannan et al, COLT 2014]. 2 New tools: Input-sparsity-time version of the BSS sampling method of [Boutsidis et al, 2011]. Input-sparsity-time version of adaptive sampling method [Desphande et al, 2006, Wang and Zhang, 2013]. Input-sparsity-time construction of U. Deterministic CUR Everything should run in polynomial time and be deterministic. 1 Existing tools: Standard SVD algorithm. Standard method to find the “best” rank k approximation to a matrix in a given subspace. Batson/Spielman/Srivastava (BSS) sampling as in [Boutsidis et all, FOCS 2011]. 2 New tools: Derandomization of the adaptive sampling of [Desphande et al, RANDOM 2006] and [Wang and Zhang, JMLR 2013]. Conclusions - Optimal, (1 + ε)-error CUR with O(k/ε) columns/rows. - Input-sparsity-time algorithm. - Deterministic polynomial-time algorithm. - Extended abstract appeared in STOC 2014. - Full version on ArXiv, June 2014.