Coresets and Sketches for High Dimensional Subspace Approximation Problems ∗ Dan Feldman† Morteza Monemizadeh‡ Abstract We consider the problem of approximating a set P of n points in Rd by a j-dimensional subspace under the `p measure, in which we wish to minimize the sum of `p distances from each point of P to this subspace. More generally, the Fq (`p )-subspace approximation problem asks for a j-subspace that minimizes the sum of qth powers of `p -distances to this subspace, up to a multiplicative factor of (1 + ). We develop techniques for subspace approximation, regression, and matrix approximation that can be used to deal with massive data sets in high dimensional spaces. In particular, we develop coresets and sketches, i.e. small space representations that approximate the input point set P with respect to the subspace approximation problem. Our results are: • A dimensionality reduction method that can be applied to Fq (`p )-clustering and shape fitting problems, such as those in [8, 15]. • The first strong coreset for F1 (`2 )-subspace approximation in high-dimensional spaces, i.e. of size polynomial in the dimension of the space. This coreset approximates the distances to any j-subspace (not just the optimal one). • A (1 + )-approximation algorithm for the j-dimensional F1 (`2 )-subspace approximation problem with running time nd(j/)O(1) + (n + d)2poly(j/) . • A streaming algorithm that maintains a coreset for the F1 (`2 )-subspace approximation problem and uses a space √ !poly(j) 2 log n (weighted) input points. of d 2 • Streaming algorithms for the above problems with bounded precision in the turnstile model, i.e, when coordinates appear in an arbitrary order and undergo multiple updates. We show that bounded precision can lead to further improvements. We extend results of [7] for approximate linear regression, distances to subspace approximation, and optimal rank-j approximation, to error measures other than the Frobenius norm. 1 Introduction The analysis of high-dimensional massive data sets is an important task in data mining, machine learning, statistics ∗ Supported in parts by DFG project So 514/4-2. of Computer Science; Tel Aviv University; Tel Aviv 69978, Israel; [email protected] ‡ Department of Computer Science, University of Dortmund, Germany; [email protected] § Department of Computer Science, University of Dortmund, Germany; [email protected] ¶ IBM Almaden Research Center, San Jose, CA; [email protected] † School Christian Sohler§ David P. Woodruff¶ and clustering. Typical applications include: pattern recognition in computer vision and image processing, bioinformatics, internet traffic analysis, web spam detection, and classification of text documents. In these applications, we often have to process huge data sets that do not fit into main memory. In order to process these very large data sets, we require streaming algorithms that read the data in a single pass and use only a small amount of memory. In other situations, data is collected in a distributed way, and shall be analyzed centrally. In this case, we need to find a way to send a small summary of the data that contains enough information to solve the problem at hand. The second problem one has to overcome is the dimensionality of the data. High-dimensional data sets are often hard to analyze, and at the same time many data sets have low intrinsic dimension. Therefore, a basic task in data analysis is to find a low dimensional space, such that most input points are close to it. A well-known approach to this problem is principle component analysis (PCA) which, for a given set of n points in d-dimensional space, computes a linear j-dimensional subspace, such that the sum of squared distances to this subspace is minimized. Since this subspace is given by certain eigenvectors of the corresponding covariance matrix, one can compute it in O(min{nd2 , dn2 }) time. However, for massive data sets this computation may already be too slow. Therefore, the problem of approximating this problem in linear time has been studied in the standard and in the streaming model of computation. Depending on the problem, it is also interesting to study other error measures, like the sum of `p -distances to the subspace or, more generally, the sum of qth powers of `p distances, and other related problems like linear regression [8] or low-rank matrix approximation [12]. For example, an advantage of the sum of distances measure is its robustness in the presence of outliers when compared to sum of non squared distances measure. However, unlike for the sum of squared errors, no closed formula exists for the optimal solution, even for the case of j = 1 (a line) in three-dimensional space [20]. In this extended abstract we mainly focus on the problem of approximating the optimal j-dimensional sub- space with respect to sum of distances. That is, we are entire documents and nodes. given a set P of n points with the objective to find a jResults and relation to previous work. P space C that minimizes cost(P, C) = p∈P minc∈C kp − • We develop a dimensionality reduction method that ck2 . We call this problem the F1 (`2 )-subspace approxican be applied to Fq (`p )-clustering and shape fitting mation problem. Most of our results generalize (in a nonproblems [15]. For example, the cluster centers can trivial way) to Fq (`p )-subspace approximations. Details be point sets, subspaces, or circles. will be given in the full version of this paper. As discussed above, we are interested in developing algorithms • We obtain the first strong coreset for F1 (`p )for huge high-dimensional point sets. In this case we need subspace approximation in high-dimensional spaces, 2 to find small representations of the data that approximate i.e. of size djO(j ) · −2 · log n (weighted) points. the original data, which allows us to solve the problem in Previously, only a strong coreset construction with a distributed setting or for a data stream. an exponential dependence on the dimension of the In this paper, we develop such representations and input space was known [11]. Other previous reapply them to develop new approximation and streaming search [9, 25, 10, 15] in this area constructed soalgorithms for subspace approximation. In particular, we called weak coresets. A weak coreset is a small set A develop strong coresets and sketches, where the coresets of points such that the span of A contains a (1 + )apply to the case of unbounded and the sketches to approximation to the optimal j-subspace. The aubounded precision arithmetic. thors [9, 10, 15] show how to find a weak coreset A strong coreset [1, 16] is a small weighted set of for sum of squared distances, and in [10] Deshpande points such that for every j-subspace of Rd , the cost of and Varadarajan obtain a weak coreset for sum of qth the coreset is approximately the same as the cost of the power of distances. All of these algorithms are in fact original point set. In contrary, weak coresets [14, 12, 9] poly(j, −1 )-pass streaming algorithms. are useful only for approximating the optimal solution. One of the benefits of strong coresets is that they are • Our next result is an improved (1+)-approximation closed under the union operation, which is, for example, algorithm for the j-dimensional subspace approxidesirable in a distributed scenario as sketched below. mation problem under the measure of sum of disApplication scenarios. For an application of coretances. The running time of our algorithm is O(1) ) sets and/or sketches, consider the following scenario. We nd(j/)O(1) + (n + d)2(j/) . This improves O(1) are aggregating data at a set of clients and we would like upon the previously best result of nd2(j/) [25, to analyze it at a central server by first reducing its di10]. mensionality via subspace approximation, projecting the data on the subspace and then clustering the projected • We then show that one can maintain √ a coreset O( log n) points. Using such a client-server architecture, it is typ)poly(j) ) in a data stream storing Õ(d( j2 2 ically not feasible to send all data to the central server. (weighted) points. From this coreset we can extract Instead, we can compute coresets of the data at the clients a (1 + )-approximation to the optimal subspace apand collect them centrally. Then we solve the subspace proximation problem. We remark that we do not approximation problem on the union of the coresets and have a bound on the time required to extract the subsend the computed subspace to all clients. Each client space from the data points. Previously, no 1-pass projects the points on the subspace and computes a corestreaming algorithm for this problem was known, exset for the clustering problem. Again, this coreset is sent cept for the case of the F2 (`2 ) objective function [7]. to the server and the clustering problem is solved. • We also study bounded precision in the turnstile Specific applications for the j-subspace approximamodel, i.e., when coordinates are represented with tion problems, include the well known “Latent SemanO(log(nd)) bits and encode, w.l.o.g., integers from tic Analysis” technique for text mining applications, the −(nd)O(1) to (nd)O(1) . The coordinates appear PageRank algorithm in the context of web search, or the in an arbitrary order and undergo multiple updates. Eigenvector centrality measure in the field of social netBounded precision is a practical assumption, and uswork analysis (see [12, 9] and the references therein). ing it now we can extract a (1 + )-approximation These problems also motivate the turnstile streaming to the optimal j-space in a data stream in polynomial model that is defined below, which is useful when new time for fixed j/. Along the way we extend the rewords (in latent semantic analysis) or new connections sults of [7] for linear regression, distance to subspace between nodes (in social network analysis) are updated approximation, and best rank-j approximation, to erover time, rather than just the insertion or deletion of ror measures other than the Frobenius norm. Techniques. In order to obtain the strong coreset, we first develop a dimensionality reduction technique for subspace approximation. The main idea of the dimensionality reduction is to project the points onto a low-dimensional subspace and approximate the difference between the projected points and the original point set. In order to do so, we need to introduce points with negative weights in the coreset. While this technique only gives an additive error, we remark that for many applications this additive error can be directly translated into a multiplicative error with respect to cost(P, C). The non-uniform sampling technique we are using to estimate the difference between the projected points and the original point set is similar to that in previous work [9, 10, 14, 25]. Although we apply the dimensionality reduction here in the context of subspaces, we remark that this technique can be easily generalized. In fact, we can replace the subspace by any closed set on which we can efficiently project points. For example, we can easily extend the dimensionality reduction method to geometric clustering problems where the centers are low dimensional objects. We can also use the technique for any problem where we are trying to minimize the sum of distances to manifolds [3, 4] (if the projection on the manifold is well-defined), which might, for example, occur in the context of kernel methods [3, 4, 21]. In order to obtain a strong coreset, we apply our dimensionality reduction recursively using the fact that, for a set P of points that are contained in an (i + 1)subspace of Rd , a coreset for i-subspaces is also a coreset for j-subspaces, 1 ≤ j ≤ d. This recursion is applied until i = 0 and the points project to the origin. This way, we obtain a small weighted sample set S of size 2 O(log(1/δ) · jO(j ) /2 ) in O(ndj2 + |S|) time, such that for an arbitrary query j-subspace C, we have (1 − ) · cost(P, C) ≤ cost(S, C) ≤ (1 + ) · cost(P, C). This result is then used to construct a strong coreset by showing that the approximation guarantee holds simultaneously for all solutions from a certain grid near the input points. Using the (standard) merge-and-reduce technique [1, 16] we use our coresets to obtain a streaming algorithm. However, we remark that the use of negatively weighted points leads to some technical complications that do not allow us to get down to logO(1) n space. With bounded precision, we use space-efficient sketches of `p -distances, a search over grid points, and a lower bound on the singular values of A to approximate the `p -regression problem in low dimensions in a data stream . A consequence is that we can efficiently approximate the sum of q-th powers of `p -distances, denoted Fq (`p ), to any fixed j-subspace. We then approximate the optimal j-subspace for Fq (`2 ) distances, 1 ≤ q ≤ 2, using a structural result of Shyamalkumar and Varadarajan [24] that proves that a (1 + )-approximate solution is spanned by r = O(j/) rows (points) of A. That is, there are two matrices B and C of size j × r and r × n, respectively, such that the columns of B · C · A span a (1 + )-approximate solution. It is not clear how to find these matrices in the streaming model, but we can use algebraic methods together with bounded precision to limit the number of candidates. We can test candidates offline using the linearity of our sketch. 1.1 Preliminaries A weighted point is a point r ∈ Rd that is associated with a weight w(r) ∈ R. We consider an (unweighted) point r ∈ Rd as having a weight of one. The (Euclidean) distance of a point r ∈ Rd to a set (usually, subspace) C ⊆ Rd is dist(r, C) := infc∈C kr − ck2 . The set C is called a center. For a closed set C, we define proj(r, C) to be the closest point to r in C, if it exists, where ties are broken arbitrarily. We further define the weight of proj(r, C) as w(proj(r, C)) = w(r). Similarly, proj(P, C) = {proj(r, C) | r ∈ P}. We let cost(P, C) = P r∈P w(r) · dist(r, C) be the weighted sum of distances from the points of P to C. Note that points with negative weights are also assigned to their closest (and not farthest) point r ∈ C. For a specific class of centers C we can now define the FqP (`p )-clustering problem as the problem to miniq mize r∈P infc∈C kr − ckp . For example, if C is the collection of sets of k points from Rd , then the F1 (`2 )clustering problem is the standard k-median problem with Euclidean distances, and the F2 (`2 )-clustering problem is the k-means problem. One specific variant of clustering that we are focusing on is the j-subspace approximation problem with F1 (`p ) objective function. The term j-subspace is used to abbreviate j-dimensional linear subspace of Rd . Given a set P of n points in Rd , the j-subspace approximation problem with F1 (`p ) objective function is to find a j-dimensional subspace C of Rd that minimizes cost(P, C). D EFINITION 1.1. ( CORESET [1, 16]) Let P be a weighted point set in Rd , and > 0. A weighted set of points Q is called a strong -coreset for the j-dimensional subspace approximation problem, if for every linear j-dimensional subspace C of Rd , we have (1 − ) · cost(P, C) ≤ cost(Q, C) ≤ (1 + ) · cost(P, C) . 2 Dimensionality Reduction for Clustering Problems In this section we present our first result, a general dimensionality reduction technique for problems that involve sums of distances as a quality measure. Our result is that 3 for an arbitrary fixed subset C ⊆ Rd , cost(P, C) can be approximated by a small weighted sample and the projection of P onto a low dimensional subspace. This result can be immediately applied to obtain a dimensionality reduction method for a large class of clustering problems, where the cluster centers are objects contained in lowdimensional spaces. Examples include: k-median clustering, subspace approximation under `1 -error, variants of projective clustering and more specialized problems where cluster centers are, for example, discs or curved surfaces. For these type of problems, we suggest an algorithm that computes a low dimensional weighted point set Q such that, with probability at least 1 − δ, for any fixed query center C, cost(Q, C) approximates cost(P, C) to within a factor of 1 ± . The algorithm is a generalization of a technique developed in [14] to compute coresets for the k-means clustering problem. The main new idea that allows us to handle any type of low dimensional center is the use of points that are associated with negative weights. To obtain this result, we first define a randomized algorithm D IM R EDUCTION; see the figure below. For a given (low-dimensional) subspace C∗ and a parameter > 0, the algorithm D IM R EDUC TION computes a weighted point set Q, such that most of the points of Q lie on C∗ , and for any fixed query center C we have E[cost(Q, C)] = cost(P, C), i.e., cost(Q, C) is an unbiased estimator of the cost of C with respect to P. Then we show that, with probability at least 1 − δ, the estimator has an additive error of at most · cost(P, C∗ ). that minimizes cost(P, C 0 ). Thus, if we compute an αapproximation C∗ for the (kj)-dimensional subspace approximation problem, and replace by /α, we obtain the result outlined above. Analysis of Algorithm D IM R EDUCTION. Let us fix an arbitrary set C. Our first step will be the following technical lemma that shows that cost(Q, C) is an unbiased estimator for cost(P, C). Let Xi denote the random variable for the sum of contributions of the sample points si and proj(s− i , C) to C, i.e. − Xi = w(si ) · dist(si , C) + w(s− i ) · dist(proj(si , C)) = w(si ) · dist(si , C) − dist(proj(si , C∗ ), C) . L EMMA 2.1. Let P be a set of points in Rd . Let > 0, 0 < δ ≤ 1, and Q be the weighted set that is returned by the randomized algorithm D IM R EDUCTION(P, C∗ , δ, ). Then E[cost(Q, C)] = cost(P, C) . Proof. We have E[Xi ] X = Pr[p] · w(p) dist(p, C) − dist(proj(p, C∗ ), C) p∈P = X1 1 · Pr[p] dist(p, C) − dist(proj(p, C∗ ), C) r Pr[p] p∈P = 1 · cost(P, C) − cost(proj(P, C∗ ), C) . r By linearity of expectation we have D IM R EDUCTION (P, C∗ , δ, ) m l points s1 , . . . , sr i.i.d. from 1. Pick r = 2 lg(2/δ) 2 P, s.t. each p ∈ P is chosen with probability Xi ] = cost(P, C) − cost(proj(P, C∗ ), C) . i=1 Since algorithm D IM R EDUCTIONcomputes the union of proj(P, C∗ ) and the points si and s− i , we obtain dist(p, C∗ ) . Pr[p] = cost P, C∗ 2. For i ← 1 to r do w(si ) ← E[ r X E[cost(Q, C)] 1 r · Pr[si ] = cost(proj(P, C∗ ), C) + E[ r X Xi ] i=1 = cost(P, C). ∗ 3. Return the multiset Q = proj(P, C ) ∪ ∗ − ∗ {s1 , . . . , sr } ∪ {proj(s− 1 , C ), . . . , proj(sr , C )}, − where si is the point si with weight −w(si ). The lemma follows. t u♣ Our next step is to show that cost(Q, C) is sharply concentrated about its mean. We can then apply this result to low dimensional clustering problems in two steps. First, we observe that, if T HEOREM 2.1. Let P be a set of n points in Rd , and each center is a low dimensional object, i.e. is contained let C∗ be a j-subspace. Let 0 < δ, ≤ 1, and Q be in a low dimensional j-subspace, then k centers are con- the weighted point set that is returned by the algorithm tained in a (kj)-subspace and so clustering them is at least D IM R EDUCTION(P, C∗ , δ, ). Then for a fixed query set as expensive as cost(P, C 0 ), where C 0 is a (kj)-subspace C ⊆ Rd we have |cost P, C) − cost Q, C)| ≤ · cost(P, C∗ ), O(jj+1 )-approximation to the optimal j-dimensional linear subspace with respect to the `1 -error. Such an apwith probability at least 1 − δ. Moreover, only proximation can be computed in O(ndj) time using the algorithm A PPROXIMATE VOLUME S AMPLING by Desh log(1/δ) pande and Varadarajan [10]. Once we have projected all r=O 2 the points on C∗j , we apply the same procedure using a (j − 1)-dimensional linear subspace C∗j−1 . We continue points of Q are not contained in proj(P, C∗ ). This algo- this process until all the points are projected onto a 0rithm runs in O(ndj + r) time. dimensional linear subspace, i.e. the origin. As we will see, this procedure can be used to approximate the cost of Proof. Let P = {p1 , . . . , pn } be a set of n points in Rd . a fixed j-subspace C. We first prove the concentration bound and then discuss the running time. A DAPTIVE S AMPLING (P, j, δ, ) In order to apply Chernoff-Hoeffding bounds [2] we 1. Pj+1 ← P. need to determine the range of values Xi can attain. By the triangle inequality we have 2. For i = j Downto 0 ∗ ∗ dist(si , C) ≤ dist(si , C ) + dist(proj(si , C ), C) (a) C∗i ← A PPROXIMATE VOLUME S AM PLING (Pi+1 , i). and (b) Qi ← D IM R EDUCTION(Pi+1 , C∗i , δ, ). ∗ ∗ dist(proj(si , C ), C) ≤ dist(si , C) + dist(si , C ). (c) P ← proj(P , C∗ ). i i+1 i (d) Si ← Qi \ Pi , where Si consists of the positively and negatively weighted sample points. Sj 3. Return S = i=0 Si . This implies |dist(si , C) − dist(proj(si , C∗ ), C)| ≤ dist(si , C∗ ). We then have |Xi | = w(si ) · dist(si , C) − dist(proj(si , C∗ ), C) Note that P0 is the origin, and so cost(P0 , C) = 0 for any j-subspace C. Let C∗i be an arbitrary but fixed sequence of linear subspaces as used in the algorithm. cost(P, C∗ ) . ≤ w(si ) · dist(si , C ) = r ∗ Thus, −cost(P, C∗ )/r ≤ Xi ≤ cost(P, C∗ )/r. Using additive Chernoff-Hoeffding bounds [2] the result follows. In order to achieve the stated running time, we proceed as follows. We first compute in O(ndj) time for each point p ∈ P its distance dist(p, C∗ ) to C∗ and store it. This can easily be done by first computing an orthonormal basis of C∗ . We sum these distances in order to obtain cost(P, C∗ ) in O(n) time. From this we can also compute Pr[p] and w(p) for each p ∈ P, in O(n) overall time. We let P be the array of probabilities p1 , . . . , pn . It is well known that one can obtain a set of r samples according to a distribution given as a length-n array in O(n + r) time, see [26]. t u♣ T HEOREM 3.1. Let P be a set of n points in Rd , and 0 , δ 0 > 0. Let C be an arbitrary j-dimensional linear subspace. If we call algorithm A DAPTIVE S AMPLING 2 with the parameters δ = O(δ 0 /(j + 1)) and = 0 /jc·j for a large enough constant c, then we get (1 − 0 ) · cost(P, C) ≤ cost(S, C) ≤ (1 + 0 ) · cost(P, C), with probability at least 1 − δ 0 . The running time of the algorithm is 2 O(ndj2 ) + 3 jO(j ) log(1/δ 0 ) . 02 From Dimensionality Reduction to Adaptive Proof. Let C be an arbitrary j-subspace. We split the Sampling proof of Theorem 3.1 into two parts. The first and easy In this section we show how to use Theorem 2.1 to obtain part is to show that cost(S, C) is an unbiased estimator a small weighted set S that, with probability at least 1 − δ, of cost(P, C). The hard part is to prove that cost(S, C) is approximates the cost to an arbitrary fixed j-subspace. sharply concentrated. The first step of the algorithm is to apply our dimensionWe can apply Lemma 2.1 with C∗ = C∗i to obtain ∗ ality reduction procedure with a j-subspace Cj that is an that for any 1 ≤ i ≤ j we have E[cost(Qi , C)] = 5 cost(Pi+1 , C) and hence E[cost(Si , C)] = cost(Pi+1 , C) − cost(Pi , C) . Therefore, E[cost(S, C)] = j X with probability at least 1 − δ. Hence, E[cost(S, C)] − cost(S, C) ! j−1 X ≤O · ii+1 · cost(Pi+1 , C), i=0 E[cost(Si , C)] i=0 = cost(Pj+1 , C) − cost(P0 , C) = cost(P, C) , where the last equality follows from Pj+1 = P and P0 being a set of n points at the origin. Now we show that cost(S, C) is sharply concentrated. We have E[cost(S, C)] − cost(S, C) with probability at least 1 − j · δ. Therefore, for our choice of δ and , a simple induction gives E[cost(S, C)] − cost(S, C) ≤ · jO(j2 ) · cost(P, C) with probability at least 1 − j · δ. Further, the running time is proven as in the proof of Theorem 2.1. t u♣ 4 Coresets In order to construct a coreset, we only have to run algorithm A DAPTIVE S AMPLING using small enough δ. j X One can compute δ by discretizing the space near the ≤ E[cost(Si , C)] − cost(Si , C) . input points using a sufficiently fine grid. Then snapping i=0 a given subspace to the nearest grid points will not change The following observation was used in [11] for j = 1, and the cost of the subspace significantly. If a subspace does generalized later in [13]. not intersect the space near the input points, its cost will L EMMA 3.1. Let C be a j-subspace, and L be an (i + 1)- be high and the overall error can be easily charged. subspace, such that i + 1 ≤ j. Then there exists an iT HEOREM 4.1. Let P denote a set of n points in Rd , j ≥ subspace Ci , and a constant 0 < νL ≤ 1, such that for 0, and 1 > 0 , δ 0 > 0, d ≤ n. Let Q be the weighted set any p ∈ L we have dist(p, C) = νL · dist(p, Ci ) . that is returned by the algorithm A DAPTIVE S AMPLING with the parameters δ = 1j · δ 0 /(10nd)10dj and = Let 0 ≤ i ≤ j. By substituting L = Span {Pi+1 } in Lemma 3.1, there is an i-subspace Ci and a constant νL , 0 /jc·j2 for a large enough constant c. Then, with such that probability at least 1 − δ 0 − 1/n2 , Q is a strong E[cost(Si , C)] − cost(Si , C) coreset. The size of the coreset in terms of the number of (weighted) points saved is = cost(Pi+1 , C) − cost(Pi , C) − cost(Si , C) = νL · |cost(Pi+1 , Ci ) − cost(Pi , Ci ) − cost(Si , Ci )| = νL · |cost(Pi+1 , Ci ) − cost(Qi , Ci )| . 2 O(djO(j ) · 0−2 log n). First we prove some auxiliary lemmata. Here, the second equality follows from the fact that the 4.1. Let P be a set of points in a subspace A of solution computed by approximate volume sampling is L EMMA d R . Let M, > 0, M > , and let G ⊆ A be such that spanned by input points and so Pi ⊆ Span {Pi+1 }. We for every c ∈ A, if dist(c, P) ≤ 2M then dist(c, G) ≤ /2. ∗ ∗ apply Theorem 2.1 with C = Ci and C = Ci to obtain Let C ⊆ A be a 1-subspace (i.e, a line that intersects the origin of Rd ), such that dist(p, C) ≤ M for every p ∈ P. |cost(Pi+1 , Ci ) − cost(Qi , Ci )| ≤ · cost(Pi+1 , C∗i ), Then there is a 1-subspace D that is spanned by a point with probability at least 1 − δ. By our choice of C∗i , we in G, such that, also have |dist(p, C) − dist(p, D)| ≤ for every p ∈ P. cost(Pi+1 , C∗i ) ≤ O(ii+1 ) · cost(Pi+1 , Ci ). Proof. Let g be a point such that the angle between the Combining the last three inequalities yields lines C and Span {g} is minimized over g ∈ G. Let D = Span {g}, and p ∈ P. We prove the lemma using |E[cost(Si , C)] − cost(Si , C)| the following case analysis: (i) dist(p, D) ≥ dist(p, C), ≤ νL · · cost(Pi+1 , C∗i ) and (ii) dist(p, D) < dist(p, C). ≤ O(νL · · ii+1 ) · cost(Pi+1 , Ci ) (i) dist(p, D) ≥ dist(p, C): Let c = proj(p, C). We have dist(c, P) ≤ kc − pk = dist(p, C) ≤ M. By the = O( · ii+1 ) · cost(Pi+1 , C) , assumption of the lemma, we thus have dist(c, G) ≤ . Let q ∈ P such that proj(q, C⊥ ) = q 0 . Let c ∈ Rd , such By the construction of D, we also have dist(c, D) ≤ that proj(c, C⊥ ) = c 0 and kc − c 0 k = kq − q 0 k. Hence, q dist(c, G). Combining the last two inequalities yields 2 2 dist(c, D) ≤ . Hence kq − ck = kq − c 0 k − kc 0 − ck q 2 2 dist(p, D) ≤ kp − ck + dist(c, D) ≤ dist(p, C) + . = kq − c 0 k − kq 0 − qk = kq 0 − c 0 k . By (4.1) and the last equation, kq − ck ≤ 2M, i.e, dist(c, P) ≤ kq − ck ≤ 2M. Using the assumption of this lemma, we thus have dist(c, G) ≤ /2, so, clearly dist c 0 , proj(G, C⊥ ) ≤ /2. From the previous paragraph, we conclude that for every c 0 ∈ C⊥ , if dist(c 0 , P 0 ) ≤ 2M then dist c 0 , proj(G, C⊥ ) ≤ /2. Using this, we apply Lemma 4.1 while replacing A with C⊥ , P with P 0 , C with proj(C, C⊥ ) and G with proj(G, C⊥ ). We obtain that there is a 1-subspace D ⊆ C⊥ that is spanned by a point from proj(G, C⊥ ), such that |dist proj(p, C⊥ ), proj(C, C⊥ ) − dist proj(p, C⊥ ), D | ≤ . Since dist proj(p, C⊥ ), proj(C, C⊥ ) = dist(p, C) by the definition of C⊥ , the last two inequalities imply (4.2) |dist(p, C) − dist proj(p, C⊥ ), D | ≤ . (ii) dist(p, D) < dist(p, C): Let q = proj(p, D), and q 0 = proj(q, C). We can assume that dist(q, q 0 ) > since otherwise by the triangle inequality, dist(p, C) ≤ dist(p, q) + dist(q, q 0 ) ≤ dist(p, D) + , and we are done. 0 and ` 0 = `/ k`k2 . Now consider the Define ` = q+q 2 0 point r = ` + 2 ` . We claim that r has distance from C and D more than /2. Assume, this is not the case. Then C (the proof for D is identical) intersects a ball B with center r and radius /2. Let r 0 be an intersection point of C with B. Let r 00 be the projection of r 0 on the span of r. Since, B has radius /2, we have that dist(r 00 , r 0 ) ≤ /2. However, the intercept theorem implies that dist(r 00 , r 0 ) > /2, a contradiction. To finish the proof, we observe that dist(p, r) ≤ dist(p, q) + dist(q, `) + dist(`, r) ≤ dist(p, C) + ≤ M + . Using M > the assumption of lemma implies dist(r, G) < /2, but dist(r, C) > /2 and dist(r, D) > /2, which means there is a grid point g 0 for which ∠(Span {g 0 } , C) < ∠(Span {g} , C), contradicting the choice of g. t u♣ Let E be the j-subspace of Rd that is spanned by D and e1 , · · · , ej−1 . Let D⊥ be the (d − 1)-subspace that d L EMMA 4.2. Let P be a set of n points in R , and is the orthogonal complement of D in Rd . Since D ⊆ E, M, > 0, M > . Let G ⊆ Rd be such that for every we have that proj(E, D⊥ ) is a (j − 1)-subspace of Rd . We c ∈ Rd , if dist(c, P) ≤ 2M then dist(c, G) ≤ /2. Let C thus have be a j-subspace, such that dist(p, C) ≤ M − (j − 1) for (4.3) every p ∈ P. Then there is a j-subspace D that is spanned dist(proj(p, C⊥ ), D) = dist(proj(p, D⊥ ), proj(E, D⊥ )) by j points from G, such that = dist(p, E). |dist(p, C) − dist(p, D)| ≤ j for every p ∈ P. Using (4.2), with the assumption of this lemma that dist(p, C) ≤ M − (j − 1), yields Proof. The proof is by induction on j. The base case of j = 1 is furnished by substituting A = Rd in Lemma 4.1. We now give a proof for the case j ≥ 2. Let e1 , · · · , ej denote a set of orthogonal unit vectors on C. Let C⊥ be the orthogonal complement of the subspace that is spanned by e1 , · · · , ej−1 . Finally, fix p ∈ P. The key observation is that for any j-subspace T in Rd that contains e1 , · · · , ej−1 , we have dist(proj(p, C⊥ ), D) ≤ dist(p, C) + ≤ M − (j − 2). By the last inequality and (4.3), we get dist(proj(p, D⊥ ), proj(E, D⊥ )) ≤ M − (j − 2). Using the last equation, by applying this lemma inductively with C as proj(E, D⊥ ), G as proj(G, D⊥ ) and P as proj(P, D⊥ ), we obtain a (j − 1)-subspace F that is dist(p, T ) = dist(proj(p, C⊥ ), proj(T, C⊥ )). spanned by j − 1 points from proj(G, D⊥ ), such that |dist(proj(p, D⊥ ), proj(E, D⊥ )) − dist(proj(P, D⊥ ), F)| ≤ Note that for such a j-subspace T , proj(T, C⊥ ) is a 1(j − 1). Hence, subspace. Let P 0 = proj(P, C⊥ ), and let c 0 ∈ C⊥ such that (4.4) dist(c 0 , P 0 ) ≤ 2M. Hence, there is a point q 0 ∈ P 0 such |dist(p, E) − dist(proj(p, D⊥ ), F)| = that |dist(proj(p, D⊥ ), proj(E, D⊥ )) − dist(proj(p, D⊥ ), F)| 0 0 0 0 (4.1) kq − c k = dist(c , P ) ≤ 2M. ≤ (j − 1). 7 Let R be the j-subspace of Rd that is spanned by D and F. Hence, R is spanned by j points of G. We have |dist(p, C) − dist(p, R)| Let S = Q \ proj(P, C∗ ). Hence, (4.5) |cost(P, C) − cost(Q, C)| = |dist(p, C) − dist(proj(p, D⊥ ), F)| = |cost(P, C) − cost(proj(P, C∗ ), C) − cost(S, C)| ≤ |dist(p, C) − dist(p, E)| ≤ |cost(P, C) − cost(proj(P, C∗ ), C)| + |cost(S, C)|. + |dist(p, E) − dist(proj(p, D⊥ ), F)|. By (4.3), we have dist(p, E) = dist(proj(p, C⊥ ), D). Together with the previous inequality, we obtain |dist(p, C) − dist(p, R)| ≤ |dist(p, C) − dist(proj(p, C⊥ ), D)| + |dist(p, E) − dist(proj(p, D⊥ ), F)|. Combining (4.2) and (4.4) in the last inequality proves the lemma. t u♣ N ET (P, M, ) 1. G ← ∅. We now bound each term in the right hand side of (4.5). Let si denote the ith point of S, 1 ≤ i ≤ |S|. By the triangle inequality, |dist(si , C) − dist(proj(si , C∗ ), C)| ≤ dist(si , C∗ ), for every 1 ≤ i ≤ |S|. Hence, |cost(S, C)| X ∗ = w(si )(dist(si , C) − dist(proj(si , C ), C)) 1≤i≤|S| X ≤ w(si )|dist(si , C∗ )| = cost(P, C∗ ). 1≤i≤|S| 2. For each p ∈ P Do (a) Gp ← vertex set of a d-dimensional grid that is centered at p. The side length of the grid is 2M, √ and the side length of each cell is /(2 d). (b) G ← G ∪ Gp . 3. Return G. Similarly, |cost(P, C) − cost(proj(P, C∗ ), C)| X X = dist(p, C) − dist(proj(p, C∗ ), C) p∈P p∈P X ≤ dist(p, C∗ ) p∈P 0 L EMMA 4.3. Let 0 < , δ < 1, and P be a set of n points in Rd with d ≤ n. Let C∗ be a jsubspace, and Q be the weighted set that is returned by the algorithm D IM R EDUCTION with the parameter δ = δ 0 /(10nd)10jd . Then, with probability at least 1 − δ 0 − 1/n2 , for every j-subspace C ⊆ Rd we have (simultaneously) = cost(P, C∗ ). Combining the last two inequalities in (4.5) yields |cost(P, C) − cost(Q, C)| ≤ |cost(P, C) − cost(proj(P, C∗ ), C)| + |cost(S, C)| ≤ 2cost(P, C∗ ) ≤ · cost(P, C). |cost P, C)−cost Q, C)| ≤ ·cost(P, C∗ )+·cost(P, C). The following two propositions prove the lemma. P ROPOSITION 4.1. For every j-subspace C of Rd such that cost(P, C) > 2cost(P, C∗ )/, t u♣ P ROPOSITION 4.2. Let 0 < < 1 and d ≤ n. With probability at least 1 − δ 0 − 1/n2 , for every j-subspace C such that we have |cost(P, C) − cost(Q, C)| ≤ · cost(P, C). Proof. Let C be a j-subspace such that cost(P, C) > 2cost(P, C∗ )/. cost(P, C) ≤ 2cost(P, C∗ )/, we have (simultaneously) |cost(P, C) − cost(Q, C)| ≤ · cost(P, C) + cost(P, C∗ ). Proof. Let G denote the set that is returned by the with (4.6) yields algorithm N ET(P ∪ proj(P, C∗ ), M, 0 ), where M = 10cost(P, C∗ )/, and 0 = cost(P, C∗ )/n10 . Note that (4.7) G is used only for the proof of this proposition. |cost(P, C) − cost(Q, C)| By Theorem 2.1, for a fixed center D ∈ G we have ≤ |cost(P, C) − cost(P, D)| + |cost(P, D) − cost(Q, D)| + |cost(Q, D) − cost(Q, C)| |cost(P, D) − cost(Q, D)| (4.6) ≤ (1 + )|cost(P, C) − cost(P, D)| + cost(P, C) ≤ · cost(P, D) + |cost(Q, D) − cost(Q, C)| ≤ · cost(P, C) + · |cost(P, C) − cost(P, D)|, ≤ cost(P, C) X |w(p)| · |dist(p, C) − dist(p, D)| +3 with probability at least p∈P∪Q ≤ cost(P, C) + 3j 0 δ0 δ0 ≥1− . 1−δ≥1− 10jd (10nd) |G|j X |w(p)|, p∈P∪Q with probability at least 1 − δ 0 . Using the union bound, (4.6) holds simultaneously Let s ∈ S be such that w(s) > 0. By the construction for every j-subspace D that is spanned by j points from of S, we have G, with probability at least 1 − δ 0 . Let p ∈ P. By the assumption of this claim, we have dist(s, C∗ ) ≥ cost(P, C∗ )/(n2 |S|) with probability at least 1 − 1/(n2 |S|). Hence, with probability at least 1 − 1/n2 , for every s ∈ S we have dist(p, C) ≤ cost(P, C) ≤ 2cost(P, C∗ )/, and also |w(s)| = dist(proj(p, C∗ ), C) ≤ kproj(p, C∗ ) − pk + dist(p, C) ≤ dist(p, C∗ ) + Combining the last two equations with (4.7) yields ∗ 2cost(P, C ) |cost(P, C) − cost(Q, C)| X ≤ cost(P, C) + 3j 0 |w(p)| ∗ ≤ cost(P, C∗ ) ≤ n2 . |S|dist(s, C∗ ) 3cost(P, C ) . p∈P∪Q ≤ cost(P, C) + cost(P, C∗ ), By the last two inequalities, for every p ∈ P ∪ proj(P, C∗ ) we have with probability at least 1 − 1/n2 − δ 0 , as desired. t u♣ 3cost(P, C∗ ) 10cost(P, C∗ ) cost(P, C∗ ) Proof. [of Theorem 4.1] Let Pi , Si , Qi and C∗i de≤ − note the set that are defined in the ith iteration of 0 A DAPTIVE S AMPLING, for every 0 ≤ i ≤ j. Us≤ M − (j − 1) , ing Theorem 2.1, the ith iterations of the algorithm A DAPTIVE S AMPLING takes O(ndj + |Si |) time. Hence, where in the last deviation we used the assumption j ≤ the algorithm takes O(ndj2 + |Q|) time. For every i, d ≤ n and 0 ≤ ≤ 1. By the construction of G, for 0 ≤ i ≤ j, we have |Si | = O(log(1/δ)/2 ). Hence, every c ∈ Rd , if dist(c, P) ≤ 2M, then dist(c, G) ≤ 0 /2. Using this, applying Lemma 4.2 with P ∪ proj(P, C∗ ) [ j log(1/δ) yields that there is a j-subspace D that is spanned by j |Q| = Si = O 2 points from G, such that 0≤i≤j 0 O(j2 ) log(1/δ ) ≤ j · . 02 |dist(p, C) − dist(p, D)| ≤ j · 0 , dist(p, C) ≤ This bounds the size of Q and its construction time. For the correctness, let 0 ≤ i ≤ j. for every p ∈ P ∪ proj(P, C∗ ). Using the last equation 9 Fix 0 ≤ i ≤ j. By the previous lemma and our for every 0 ≤ i ≤ j. choice of δ, we conclude that, with probability at least Combining the last inequalities together yields, 1 − δ 0 /j − 1/n2 , for any j-subspace C we have Pr[|cost(P, C) − cost(Q, C)| ≤ 0 cost(P, C)] |cost(Pi+1 , C) − cost(Qi , C)| ≥ 1 − δ 0 − 1/n2 . ≤ cost(Pi+1 , C) + cost(Pi+1 , C∗i ) ≤ ( 0 /2)cost(Pi+1 , C) + By construction of C∗i , t u♣ 0 j cost(Pi+1 , C∗i ). 4j+4 we have 5 Subspace Approximation In this section we show how to construct in cost(Pi+1 , C∗i ) ≤ j2j+2 min cost(Pi+1 , C 0 ) 0 O(nd · poly(j/) + (n + d) · 2poly(j/) ) C ≤ j2j+2 cost(Pi+1 , C). Combining the last two inequalities yields |cost(Pi+1 , C) − cost(Qi , C)| ≤ 0 j2j+2 · cost(Pi+1 , C), with probability at least 1 − δ 0 /j − 1/n2 . Summing the last equation over all the j iterations of A DAPTIVE S AMPLING yields |cost(P, C) − cost(Q, C)| [ = |cost(P, C) − cost(Si , C)| ≤| 0≤i≤j X cost(Pi+1 , C) − cost(Pi , C) − cost(Si , C) | 0≤i≤j =| X cost(Pi+1 , C) − cost(Qi , C) | 0≤i≤j ≤ X |cost(Pi+1 , C) − cost(Qi , C)| 0≤i≤j 0 ≤ j2j+2 X time, a small set C of candidate solutions (i.e., jsubspaces) such that C contains a (1+/3)-approximation to the subspace approximation problem, i.e., for the point set P, one of the j-subspaces in C is a (1 + /3)approximation to the optimal j-subspace. Given such a candidate set C, we run the algorithm A DAPTIVE S AM PLING with parameters δ/|C| and /6. By the union bound it follows that every C ∈ C is approximated by a factor of (1 ± /6) with probability at least 1 − δ. It follows that the cost of the optimal candidate solution in C is a 1 + O()-approximation to the cost of the optimal j-subspace of the original set of points P. The intuition behind the algorithm and the analysis. The first step of the algorithm is to invoke approximate volume sampling due to Deshpande and Varadarae 4+ jan [10] to obtain in O(nd · poly(j/)) time, an O(j 3 (j/) )-dimensional subspace A that contains a (1+/6)approximation j-subspace. We use C0 to denote a linear j-dimensional subspace of A with cost(P, C0 ) ≤ (1 + /6) · Opt. cost(Pi+1 , C), ≤ cost(Pi+1 , C) + cost(Pi , C) − cost(Pi+1 , C) Our candidate set C will consist of subspaces of A. Then the algorithm proceeds in j phases. In phase i, the algorithm S computes a set Gi of points in A. We define G≤i = 1≤l≤i Gl . The algorithm maintains, with probability at least 1 − i·δ j , the invariant that i points from G≤i span an i-subspace Hi such that there exists another j-subspace Ci , Hi ⊆ Ci ⊆ A, with ≤ cost(Pi+1 , C) + |cost(Pi , C) − cost(Pi+1 , C)| cost(P, Ci ) ≤ (1 + /6) · (1 + γ)i · Opt 0≤i≤j with probability at least 1 − δ 0 − 1/n2 . We also have cost(Pi , C) = cost(Pi+1 , C) + νL · |cost(Pi , Ci ) − cost(Pi+1 , Ci )| ≤ cost(Pi+1 , C) + νL · cost(Pi+1 , C∗i ) ≤ cost(Pi+1 , C) + νL · ii+1 · cost(Pi+1 , Ci ) = cost(Pi+1 , C) + ii+1 · cost(Pi+1 , C) = (ii+1 + 1) · cost(Pi+1 , C) Hence, cost(Pi+1 , C) ≤ j2j+2 cost(P, C) ≤ (1 + /3) · Opt, where Opt is the cost of an optimal subspace (not necessarily contained in A) and γ = /(12j) is an approximation parameter. The candidate set C would be the spans of every j points from G≤j . 5.1 The algorithm. In the following, we present our algorithm to compute the candidate set. We use H⊥ i to denote the orthogonal complement of a linear subspace Notation C A Opt Ci Hi H⊥ i C∗i Ni rl q in case 1 q in case 2 q0 ` `0 C⊥ i Li Ci+1 N (p, R, A, γ) Meaning The set of candidate solutions (i.e., j-subspaces) The poly(j/)-subspace that contains a (1 + /6)-approximation The cost of an optimal subspace (not necessarily contained in A) A j-subspace which is a (1 + γ)i -approximation to the optimal j-subspace of P An i-subspace which is the span of i points from G≤i where Hi ⊆ Ci The orthogonal complement of the linear subspace Hi in Rd The projection of Ci on H⊥ i {p ∈ Pi+1 : dist(p, Ci ) ≤ 2 · cost(P, Ci ) · Pr[p]} A point in Ni ⊆ H⊥ i that has Pr[rl ] > 0 proj(rl , C∗i ) A point in C∗i ∩ B(proj(rl , C∗i ), 5 · Opt · Pr[rl ], A ∩ H⊥ i ) s.t. dist(q, 0) ≥ 5 · Opt · Pr[rl ] A point in N (rl , 10 · Opt · Pr[rl ], A ∩ H⊥ , γ/20) s.t dist(q, q 0 ) ≤ γ2 · Opt · Pr[rl ] i Span {q} Span {q 0 } The orthogonal complement of ` in Ci d The orthogonal complement of C⊥ i in R ⊥ A j-subspace which is the span of Ci and ` 0 A γ-net of a ball B(p, R, A) in the subspace A with radius R centered at p Table 1: Notation in Section 5. Hi in Rd . We use N (p, R, A, γ) to denote a γ-net of a ball B(p, R, A) in the subspace A with radius R centered at p, i.e. a set of points such that for every point t ∈ B(p, R, A) there exists a point q in N (p, R, A, γ) with dist(t, q) ≤ γR. It is√easy to see that a γ-net of a ball 0 B(p, R, A) of size O( d 0 /γd ) (See [2]) exists, where d 0 is the dimension of A. The input to the algorithm is the point set P 0 = proj(P, A) in the space A, an idimensional linear subspace Hi and the parameters i and γ. The algorithm is invoked with i = 0 and Hi = 0 and j being the dimension of the subspace that is sought. Notice that the algorithm can be carried out in the space A since Hi ⊆ A and so the projection of P 0 to H⊥ i will be inside A. Note, that although the algorithm doesn’t know the cost Opt of an optimal solution, it is easy to compute the cost of an O(jj+1 )-approximation using approximate volume sampling. From this approximation we can generate O(j log j) guesses for Opt, one of which includes a constant factor approximation. C ANDIDATE S ET (P 0 , Hi , i, j, γ) 1. if i = j then return Hi . 2. Pi+1 ← proj(P 0 , H⊥ i ). 3. Sample s = dlog(j/δ)e points r1 , . . . , rs i.i.d. from Pi+1 s.t. each p ∈ Pi+1 is chosen with probability X Pr[p] = dist(p, 0)/ dist(q, 0) q∈Pi+1 4. G ← Si+1 s N (rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i , γ/20). l=1 S 5. return q∈Gi+1 C ANDIDATE S ET(P 0 , Span {Hi ∪ q} , i + 1, j, γ). 5.2 Invariant of algorithm C ANDIDATE S ET. We will prove that the algorithm satisfies the following lemma. L EMMA 5.1. Let Ci ⊆ A be a subspace that contains Hi . Assume that Ci is a (1 + γ)i -approximation to the optimal j-subspace of P. Then, with probability at least 1 − δ/j, there is a j-subspace Ci+1 ⊆ A containing Hi and a point from Gi+1 , such that Ci+1 is a (1 + γ)i+1 approximation to the optimal j-subspace of P. Once the lemma is proved, we can apply it inductively to show that with probability at least 1 − δ we have 11 a subspace Cj that is spanned by j points from G≤j and We use C∗i to denote the projection of Ci on H⊥ i . Note that has that C∗i has j − i dimensions as Hi ⊆ Ci . The idea is to find a point q from Gi+1 ⊆ H⊥ i ∩ A such that we can cost(P, Cj ) ≤ (1 + γ)j · cost(P, C0 ) rotate C∗i in a certain way to contain q and this rotation ≤ (1 + /6) · (1 + γ)j · Opt will not change the cost with respect to P significantly. Let ≤ (1 + /6) · (1 + /12) · Opt ≤ (1 + /3) · Opt. The running time of the algorithm is dominated by the projections in line 2, j of which are carried out for each element of the candidate set. Since the input P 0 to the algorithm is in the subspace A, its running time is n · 2poly(j/) . To initialize the algorithm, we have to compute space A and project all points on A. This can be done in O(nd · poly(j/)) time [10]. Finally, we run algorithm A DAPTIVE S AMPLING to approximate the cost for every candidate solution generated by algorithm C ANDIDATE S ET. For each candidate solution, we have to project all points on its span. This can be done in O(d · 2poly(j/) ) time, since the number of candidate solutions is 2poly(j/) and the size of the sample is poly(j/). Thus we can summarize the result in the following theorem setting δ = 1/6 in the approximate volume sampling and in our algorithm. T HEOREM 5.1. Let P be a set of n points in Rd , 0 < < 1 and 1 ≤ j ≤ d. A (1 + )-approximation for the j-subspace approximation problem can be computed, with probability at least 2/3, in time O(nd · poly(j/) + (n + d) · 2poly(j/) ). Ni = {p ∈ Pi+1 : dist(p, Ci ) ≤ 2 · cost(P, Ci ) · Pr[p]}. Ni contains all points that are close to the subspace Ci , where closeness is defined relative to the distance from the origin. We will first show that by sampling points with probability proportional to their distance from the origin, we are likely to find a point from Ni . P ROPOSITION 5.1. Pr[∃rl , 1 ≤ l ≤ s : rl ∈ Ni ] ≥ 1 − δ/j . Proof. We first prove by contradiction that the probability to sample a point from Ni is at least 1/2. Assume that X Pr[p] > 1/2. p∈Pi+1 \Ni Observe that cost(P, Ci ) ≥ cost(P 0 , Ci ) since Ci ⊆ A and P 0 = proj(P, A). Further, cost(P 0 , Ci ) = cost(Pi+1 , Ci ) since Pi+1 = proj(P 0 , H⊥ i ) and Hi ⊆ Ci . It follows that cost(P, Ci ) ≥ cost(P 0 , Ci ) = cost(Pi+1 , Ci ) X ≥ dist(p, Ci ) 5.3 Overview of the proof of Lemma 5.1. The basic p∈Pi+1 \Ni idea of the proof follows earlier results of [25]. We X show that by sampling with probability proportional to Pr[p] > 2 · cost(P, Ci ) · the distance from the origin, we can find a point p p∈Pi+1 \Ni whose distance to the optimal solution is only a constant > cost(P, Ci ), factor more than the weighted average distance (where the weighting is done according to the distance from the which is a contradiction. Hence, origin). If we then consider a ball with radius a constant X times the average weighted distance and that is centered at Pr[rl ∈ Ni ] = Pr[p] ≥ 1/2. p, then this ball must intersect the projection of the current p∈Ni space Ci solution on H⊥ i . If we now place a fine enough net on this ball, then there must be a point q of this net It follows that that is close to the projection. We can then define a certain Pr[∃l, 1 ≤ l ≤ s : rl ∈ Ni ] ≥ 1 − (1 − 1/2)s rotation of the current subspace to contain q to obtain the ≥ 1 − δ/j. new subspace Ci+1 . This rotation increases the cost only slightly and Ci+1 contains Span {Hi ∪ {q}}. 5.4 The complete proof of Lemma 5.1. We assume that there is a j-subspace Ci , Hi ⊆ Ci ⊆ A, with cost(P, Ci ) ≤ (1 + γ)i · cost(P, C0 ) ≤ (1 + /3) · Opt. t u♣ We now make a case distinction in order to prove Lemma 5.1. Case 1: Points are on average much closer to Ci than to the origin. We first consider the case that X X dist(p, Ci ). dist(p, 0) ≥ 4 This implies dist(proj(p, `), ` 00 ) = dist(proj(p, `), 0) · sin β ≤ dist(p, 0) · sin α. p∈Pi+1 p∈Pi+1 We need the following claim that the distance of q to the origin is not much smaller than the distance of rl to the origin. In this case, the points in Ni are much closer to Ci than to the origin. Now let rl be a point from Ni ⊆ H⊥ i that has Pr[rl ] > 0. Since Ci ⊆ A and dist(rl , C∗i ) P ROPOSITION 5.2. If X X dist(p, Ci ) dist(p, 0) ≥ 4 = dist(rl , Ci ) ≤ 2 · cost(P, Ci ) · Pr[rl ] we know that B(rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i ) intersects C∗i . This also implies that q := proj(rl , C∗i ) lies in then B(rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i ). Hence, there is a point 0 q ∈ N (rl , 10 · Opt · Pr[rl ], A ∩ p∈Pi+1 p∈Pi+1 dist(q, 0) ≥ H⊥ i , γ/20) 1 dist(rl , 0). 2 Proof. Since rl ∈ Ni we have with dist(q, q 0 ) ≤ γ2 · Opt · Pr[rl ]. 0 Let ` be the line through q and let ` be the line dist(rl , 0) through q 0 . Let C⊥ . dist(rl , Ci ) ≤ 2Opt P i denote the orthogonal complement p∈Pi+1 dist(p, 0) of ` in Ci . Define the subspace Ci+1 as the span of C⊥ i and ` 0 . Since q lies in C∗i (and hence in H⊥ i ) we have By our assumption we have that C⊥ i contains Hi . Hence, Ci+1 also contains Hi . It X X remains to show that dist(p, 0) ≥ 4 dist(p, Ci ), p∈P p∈P i+1 i+1 cost(P, Ci+1 ) ≤ (1 + γ) · cost(P, Ci ). which implies dist(rl , Ci ) ≤ 21 dist(rl , 0) by plugging in into the previous inequality. We further have dist(rl , Ci ) = dist(rl , C∗i ) and so We have (5.8) (5.9) cost(P, Ci+1 ) − cost(P, Ci ) X ≤ dist(proj(p, Ci ), Ci+1 ) dist(q, 0) ≥ dist(rl , 0) − dist(rl , C∗i ) ≥ p∈P (5.10) = X dist(proj(proj(p, A), Ci ), Ci+1 ) (5.11) = dist(proj(p, Ci ), Ci+1 ) We get p∈P 0 (5.12) = t u♣ by the triangle inequality. p∈P X 1 dist(rl , 0) 2 X dist(proj(p, Ci ), Ci+1 ) sin α p∈Pi+1 ≤ dist(q, q 0 ) dist(q, 0) 1/2 · γ · Opt · Pr[rl ] 1/2 · dist(rl , 0) γ · Opt · dist(rl , 0) P dist(rl , 0) · p∈Pi+1 dist(p, 0) where Step 5.10 follows from the fact that Ci ⊆ A and ≤ so proj(proj(p, A), Ci ) = proj(p, Ci ) for all p ∈ Rd and Step 5.12 follows from Hi ⊆ Ci , Ci+1 . = Now define Li to be the orthogonal complement of d d C⊥ γ · Opt i in R . Note that for any p ∈ R and its projection 0 = P . p = proj(p, Li ) we have dist(p, Ci ) = dist(p 0 , Ci ) and p∈Pi+1 dist(p, 0) dist(p, Ci+1 ) = dist(p 0 , Ci+1 ). Further observe that Ci corresponds to the line ` in Li and Ci+1 corresponds to a The latter implies line ` 00 = proj(` 0 , Li ). Define α to be the angle between ` X and ` 0 and β the angle between ` and ` 00 . Note that α ≥ β. cost(P, Ci+1 ) − cost(P, Ci ) ≤ dist(p, 0) · sin α Then p∈Pi+1 dist(proj(p, Ci ), Ci+1 ) = dist(proj(proj(p, Ci ), Li ), ` 00 ) ≤ γ · Opt = dist(proj(p, `), ` 00 ). ≤ γ · cost(P, Ci ) 13 Now define Li to be the orthogonal complement of C⊥ Note that for any p ∈ Rd and its projection i . 0 Case 2: Points are on average much closer to the p = proj(p, Li ) we have dist(p, Ci ) = dist(p 0 , Ci ) and dist(p, Ci+1 ) = dist(p 0 , Ci+1 ). Further observe that Ci origin than to Ci . corresponds to the line ` in Li and Ci+1 corresponds to a Now we consider the case that line ` 00 = proj(` 0 , Li ). X X dist(p, Ci ). dist(p, 0) < 4 Define α to be the angle between ` and ` 0 and β the p∈Pi+1 p∈Pi+1 angle between ` and ` 00 . Note that α ≥ β. Then which implies the lemma in Case 1. Let rl be a point from Pi+1 ⊆ H⊥ i that is in Ni and that has Pr[rl ] > 0. Since Ci ⊆ A and dist(rl , C∗i ) = dist(rl , Ci ) ≤ 2 · cost(P, Ci ) · Pr[rl ], we know that B(rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i ) intersects C∗i . This implies that proj(rl , C∗i ) lies also in B(rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i ). In fact, dist(proj(p, Ci ), Ci+1 ) = dist(proj(proj(p, Ci ), Li ), ` 00 ) = dist(proj(p, `), ` 00 ). This implies dist(proj(p, `), ` 00 ) = dist(proj(p, `), 0) · sin β ≤ dist(p, 0) · sin α. We have sin α ≤ 2 · cost(P, Ci ) · Pr[rl ] ≤ 5 · Opt · Pr[rl ] Similar to the first case it follows that X cost(P, Ci+1 ) − cost(P, Ci ) ≤ dist(p, 0) · sin α implies that B(proj(rl , C∗i ), 5 · Opt · Pr[rl ], A ∩ H⊥ i ) p∈Pi+1 ⊆ B(rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i ). ≤ Since C∗i ⊆ A ∩ H⊥ i we also have that there is a point X γ · dist(p, 0). 5 p∈Pi+1 q ∈ C∗i ∩ B(proj(rl , C∗i ), 5 · Opt · Pr[rl ], A ∩ H⊥ i ) Since we are in Case 2 we have X dist(p, 0) < 4 · cost(Pi+1 , Ci ), with dist(q, 0) ≥ 5 · Opt · Pr[rl ]. Now consider the set which is the intersection of N (rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i , γ/20) γ γ · Opt · Pr[rl ] ≤ . 5 · Opt · Pr[rl ] 5 p∈Pi+1 which implies cost(P, Ci+1 ) − cost(P, Ci ) ≤ with X γ · dist(p, 0) 5 p∈Pi+1 B(proj(rl , Ci ), 5 · Opt · Pr[rl ], A ∩ H⊥ i ), ≤ γ · cost(Pi+1 , Ci ) ≤ γ · cost(P, Ci ). which is a (γ/10)-net of B(proj(rl , Ci ), 5 · Opt · Pr[rl ], A ∩ H⊥ i ). Hence, there is a point This concludes the proof of Lemma 5.1. 6 Streaming Algorithms in the Read-Only Model We can maintain our coreset with √ !poly(j) O( log n) j2 γ · 5 · Opt · Pr[rl ] ≤ γ · Opt · Pr[rl ]. with dist(q, q 0 ) ≤ 10 Õ d 2 0 Let ` be the line through q and let ` be the line through q 0 . Let C⊥ i denote the orthogonal complement (weighted) points via known merge and reduce technique of ` in Ci . Define the subspace Ci+1 as the span of C⊥ i 0 ∗ ⊥ and ` . Since q lies in Ci we have that Ci contains Hi . [1, 16] in the read-only streaming model where only insertion of a point is allowed. The presence of negative Hence, Ci+1 also contains Hi . points makes the process of maintaining a coreset harder. It remains to show that The problem is that the sum of the absolute weights of the cost(P, Ci+1 ) ≤ (1 + γ) · cost(P, Ci ). coreset is about three times the size of the input point set. q 0 ∈ N (rl , 10 · Opt · Pr[rl ], A ∩ H⊥ i , γ/20) If we now apply our coreset construction several times (as is required during merge and reduce), we blow up the sum of absolute weights with each application by a constant factor. This blow-up, together with the fact that we have to estimate the difference between positively and negatively weighted points, cannot be controlled as well as in the case of a standard merge and reduce approach, and requires taking larger sample sizes with every merge step. The proof of the following theorem will appear in the full version of this paper. Proof. Let σj+1 denote the (j + 1)-th singular value of A. By Lemma 7.1, we have ||A − Aj ||2 ≥ σj+1 ≥ 1/∆5j/2 , where the first inequality follows by standard properties of the singular values. t u♣ T HEOREM 6.1. Let C be a fixed j-subspace of Rd . Let P be a set of n points in Rd , j ≥ 0, and , δ > 0. In the read-only streaming model we can maintain two sets S 00 and Q using √ !poly(j) log n j · 2 Õ d 2 C OROLLARY 7.2. Let A be an n × d matrix, and p, q = O(1). If rank(A) ≥ j + 1 then an n × d matrix A, let Fq (`p )(A) = Pn For (i) q i ||A ||q p , where A is the i-th row of A. Let Aj,p i=1 be the matrix which minimizes Fq (`p )(B − A) over every n × d matrix B of rank j. The next corollary follows from relations between norms and singular values. O(j) Fq (`p )(A − Aq . j,p ) ≥ 1/∆ Proof. By Corollary 7.1, F2 (`2 )(Aj − A) = n X ||(Aj )i − Ai ||22 ≥ 1/∆5j . i=1 weighted points such that, with probability at least 1 − δ, We use the following relations between norms. Let x be a d-dimensional vector. For any a ≥ b, |cost(P, C) − cost(S , C) − cost(Q, C)| ≤ · cost(P, C). 00 Moreover, Õ() notation hides poly(log n) factors. 7 ||x||b (a−b)/ab d ≤ ||x||a ≤ ||x||b . Streaming Algorithms with Bounded Precision in It follows that for any p ≤ 2, the Turnstile Model (7.13) In this section, we consider the problems of previous n X q i i 2 sections in the 1-pass turnstile streaming model. In this F2 (`p )(Aj,p − A) = ||(Aq j,p ) − A ||p model, coordinates of points may arrive in an arbitrary i=1 n order and undergo multiple updates. We shall assume X i i 2 ≥ ||(Aq that matrices and vectors are represented with bounded j,p ) − A ||2 i=1 precision, that is, their entries are integers between −∆ n X and ∆, where ∆ ≥ (nd)B , and B > 1 is a constant. We ≥ ||(Aj )i − Ai ||22 also assume that n ≥ d. i=1 The rank of a matrix A is denoted by rank(A). ≥ 1/∆5j , The best rank-j approximation of A is the matrix Aj such that kA − Aj kF is minimized over every matrix of On the other hand, if p > 2, rank at most j, and k·kF is the Frobenius norm (sum of n squares of the entries). Recall that using the singular X q i i 2 F2 (`p )(Aj,p − A) = ||(Aq value decomposition every matrix A can be expressed as j,p ) − A ||p T i=1 UΣV , where the columns of U and V are orthonormal, !2 n i i and Σ is a diagonal matrix with the singular values along X ||(Aq j,p ) − A ||2 ≥ the diagonal (which are all positive). d(p−2)/(2p) ) i=1 n L EMMA 7.1. ([7]) Let A be an n × d integer matrix X ||(Aj )i − Ai ||22 represented with bounded precision. Then for every w, ≥ d(p−2)/p 1 ≤ w ≤ rank(A), the w-th largest singular value of A is i=1 at least 1/∆5(w−1)/2 . 1 ≥ 5j+1 , ∆ C OROLLARY 7.1. Let A be an n × d integer matrix represented with bounded precision. Let Aj be the best where we use ∆ ≥ nd. Hence, in either case, 1/2 rank-j approximation of A. If rank(A) ≥ j + 1 then F2 (`p )(Aq ≥ 1/∆5j/2+1/2 . ||A − Aj ||F ≥ 1/∆5j/2 . j,p − A) 15 Again, appealing to the right part of (7.13), if q ≤ 2, then 1/q 1/2 q Fq (`p )(Aq − A) ≥ F (` )(A − A) 2 p j,p j,p ≥ L EMMA 7.2. Suppose A is an n × d matrix, and b is an n × 1 column vector with bounded precision. Then, min ||Ax − b||p ≤ x∈Rd 1 ∆5j/2+1/2 . x∈Rd Fq (`p )(Aq j,p − A) ||Ax − b||p ≤ min ||Ax − b||p + γ. If instead q > 2, min x∈Gp,γ,d,∆Θ(d) 1/q ≥ ≥ − A) 1/2 n(q−2)/(2q) ≥ F2 (`p )(Aq j,p F2 (`p )(Aq j,p − A) 1/2 Proof. Let x∗ = argminx∈Rd ||Ax − b||p . We first argue that the entries of x∗ cannot be too large. We can suppose x∗ 6= 0d , as otherwise the entries are all bounded in absolute value by 0. By the triangle inequality, n1/2 1 ∆5j/2+1 using that ∆ ≥ nd. Hence, in all cases, ||Ax∗ − b||p + ||b||p ≥ ||Ax∗ ||p . , Now, ||Ax∗ − b||p + ||b||p ≤ 2||b||p ≤ 2n∆. 5jq/2+q Fq (`p )(Aq = 1/∆Θ(j) . j,p − A) ≥ 1/∆ Also, ||Ax∗ ||p ≥ ||Ax∗ ||2 . Since x∗ 6= 0d , it holds that ||Ax∗ ||2 ≥ σr ||x∗ ||2 , where r = rank(A) and σr is the In the remainder of the section, we shall assume that smallest singular value of A. Hence, p and q are in the interval [1, 2]. It is known [5, 6] that ||x∗ ||2 ≤ 2n∆/σr . estimating ||x||r for any r > 2 requires polynomial space in the turnstile model, so this assumption is needed. Also, By Lemma 7.1, σr ≥ ∆−5d/2 , and so p, q ≥ 1 in order to be norms. We start by solving approximate linear regression. ||x∗ ||∞ ≤ ||x∗ ||2 ≤ 2n∆ · ∆5d/2 ≤ ∆6d . We use this to efficiently solve distance to subspace approximation. Finally, we show how to efficiently (1 + )- Put G = Gp,γ,d,∆6d . Then approximate the best rank-j approximation via a certain min ||Ax − b||p ≤ min ||Ax − b||p x∈G discretization of subspaces. Our space is significantly less x∈Rd than the input matrix description O(nd log(nd)). ≤ max min ||Ax + Ay − b||p t u♣ y∈{0,γ/(d1+1/p ∆)}d x∈G 7.1 Approximate Linear Regression ≤ min ||Ax − b||p + x∈G max y∈{0,γ/(d1+1/p ∆)}d ||Ay||p D EFINITION 7.1. (A PPROXIMATE LINEAR REGRESSION ) ≤ min ||Ax − b||p + (d(d∆γ/(d1+1/p ∆)p )1/p Let A be an n × d matrix, and b be an n × 1 vector. x∈G Assume that A and b are represented with bounded ≤ min ||Ax − b||p + γ. x∈G precision, and given in a stream. The approximate linear 0 d regression problem is to output a vector x ∈ R so that t u♣ with probability at least 2/3, We use the following sketching result (see also [17, 23]). ||Ax 0 − b||p − min ||Ax − b||p x∈Rd T HEOREM 7.1. ([19]) For 1 ≤ p ≤ 2, if one chooses the entries of an (log 1/δ)/2 × n matrix S with en≤ min ||Ax − b||p . d x∈R tries that are p-stable O(log 1/)-wise independent rand dom variables rounded to the nearest integer multiple Let Gp,γ,d,Z be the d-dimensional grid in R −2 = (nd)−2B and bounded in absolute value by of all points whose entries are integer multiples of of ∆ 2 2B n γ/(d1+1/p ∆), and bounded in absolute value by Z. We ∆ = (nd) , then for any fixed x ∈ R , with integer show that if we restrict the solution space to the grid entries bounded in absolute value by ∆, there is an effiGp,γ,d,∆Θ(d) , then we can minimize ||Ax − b||p up to cient algorithm A which, given Sx, outputs a (1 ± /3)a small additive error γ. The proof follows by bounding approximation to ||x||p with probability at least 1 − δ. The algorithm can be implemented in O(log(nd) log 1/δ)/2 the entries of an optimal solution x∗ ∈ Rd , where bits of space. Moreover, it can be assumed to output an x∗ = argminx∈Rd ||Ax − b||p . implicit representation of S. T HEOREM 7.2. There is a 1-pass algorithm, which given Moreover, for ≤ 1, (1 + /3)/3 ≤ 2/3. Hence, the entries of A and b in a stream, solves the Approximate min ||Ax − b||p ≤ A(SAx 0 − Sb) 1− Linear Regression Problem with O(d3 log2 (nd))/2 bits 3 x∈Rd Θ(d2 ) of space and ∆ time (we will only invoke this later ≤ (1 + ) min ||Ax − b||p . as a subroutine for small values of d). x∈Rd Proof. [of Theorem 7.2] We first consider the case that b By a union bound, the algorithms succeed with probis in the columnspace of A. In this case, we have ability ≥ 3/4 − 1/12 = 2/3. The time complexity is dominated by enumerating grid points, which can be 0 = min ||Ax − b||p = min ||Ax − b||2 . 2 2 d d done in ∆Θ(d ) = (nd)Θ(j ) time. This assumes that x∈R x∈R −Θ(d) , since when = ∆−Θ(d) the problem reLet y be such that Ay = b. In [24], it is shown > ∆ t u♣ how to recover y with O(d2 log(nd)) bits of space with duces to exact computation. The theorem follows. probability at least 11/12, and to simultaneously report that b is in the columnspace of A. We now consider the case that b is not in the columnspace of A. We seek to lower bound minx∈Rd ||Ax − b||p . Consider the n × (d + 1) matrix A 0 whose columns are the columns a1 , . . . , ad of A adjoined to b. Also consider any n×(d+1) matrix T whose columns are in the columnspace of A. Then 7.2 Distance to Subspace Approximation Given an n×d integer matrix A in a stream with bounded precision, we consider the problem of maintaining a sketch of A so that from the sketch, for any subspace F in Rd of dimension j, represented by a j × d matrix of bounded precision, with probability at least 2/3, one can output a (1 ± )-approximation to min ||Ax − b||p = min ||T − A 0 ||p . Fq (`p )(projF (A) − A), T x∈Rd where projF (A) is the projection of A onto F. Since b is not in the columnspace of T , rank(A 0 ) = rank(T ) + 1. By Corollary 7.2, it follows that min ||Ax − b||p ≥ 1/∆ Θ(d) x∈Rd . Put γ = /(3∆Θ(d) ), and let G = Gp,γ,d,∆Θ(d) as defined above, so T HEOREM 7.3. For p, q ∈ [1, 2], there is a 1-pass algorithm, which solves the Distance to Subspace Approximation Problem with O(nj3 log3 (nd)/2 ) bits of space 2 and ∆O(j ) time. If p = 2 this can be improved to be 2 O(nj log(nd))/ space and poly(j log(nd)/) time. Proof. Set δ = 1/(3n). We sketch each of the n rows of A as in the algorithm of Theorem 7.2. That algorithm also outputs a representation of its sketching matrix S. In the offline phase, we are given F, and we compute F · S. We independently solve the `p -regression problem with matrix F and each of the n rows of A. For each row Ai , we approximate minx∈Rj ||xF − Ai ||p . By a union bound, from the estimated costs for the rows, we get a (1 ± )approximation to Fq (`p )(projF (A) − A). The value of d in the invocation of Theorem 7.2 is j. Using results in [7], for p = 2 this can be somewhat improved. t u♣ |G| ≤ (3d1+1/p ∆Θ(d) /)d . For 1 ≤ p ≤ 2, let S be a random (log 1/δ 0 )/2 × n matrix as in Theorem 7.1, where δ 0 = Θ(1/|G|). The algorithm maintains S · A in the data stream, which can be done with O(d3 ·log2 (nd))/2 bits of space. Let A be the efficient algorithm in Theorem 7.1. Then for a sufficiently small δ 0 , with probability at least 3/4, for every x ∈ G, (7.14) |A(SAx − Sb) − ||Ax − b||p | ≤ ||Ax − b||p . 3 7.3 Best Rank-j Approximation Given an n × d matrix A with bounded precision in a stream, we consider the problem of maintaining a sketch of A so that one can (1 + )-approximate the value Fq (`p )(Aq j,p − A) with probability at least 2/3. We shall only consider p = 2. We first give a 1-pass algorithm near-optimal in space, but with poor running time, using sketches of [18]. We then improve this to achieve polynomial running time. Note that the case (q, p) = (2, 2) was solved in [7]. For the time-inefficient solution, the work of [18] gives a sketch SA of the n × d input matrix A so that By Lemma 7.2, there is an x 0 ∈ G for which min ||Ax − b||p ≤ ||Ax 0 − b||p x∈Rd ≤ min ||Ax − b||p + x∈Rd Moreover, ||Ax 0 − b||p 1− 3 3∆Θ(d) . A(SAx 0 − Sb) ≤ 1+ ||Ax 0 − b||p . 3 ≤ 17 Fq (`2 )(A) (recall q ≤ 2) is estimable to within (1 + ) with probability 1 − δ using poly(log(nd)/) log 1/δ bits of space, where all entries are integer multiples of 1/∆ and bounded in magnitude by ∆. In the offline phase, we enumerate all rank-j matrices F with a certain precision. In Lemma 7.3 we show we can consider only 2 ∆O(j (d+n)) different F. We compute SF − SA for each F by linearity, and we choose the F minimizing the 2 estimate. Setting δ = Θ(∆−O(j (d+n)) ), we get a 1pass (n + d)poly((log nd)/)-space algorithm, though the time complexity is poor. O(j2 (d+n)) L EMMA 7.3. There is a set of ∆ different matrices F containing a (1 + )-approximation to the best rank-j approximation to A. Proof. By Corollary 7.2, we can restrict to F with entries that are integer multiples of ∆−Θ(j) and bounded in magnitude by poly(∆). Choose a subset of j rows of F to be linearly independent. There are nj choices 2 Proof. We can assume rank(A) ≥ r > j. If rank(A) < r but rank(A) > j, we can just repeat this process for each value of ` between j and r, replacing r in the analysis below with the value `. We then take the union of the sets of matrices that are found. This multiplies the number of sets by a negligible factor of r. From Theorem 7.4, there is a j × r matrix B for which BCA has orthonormal rows and for which we have Fq (`2 )(projB·C·A (A) − A) ≤ (1 + )Fq (`2 )(Aq j,2 − A). There may be multiple such B; for a fixed C, A we let B be the matrix that minimizes Fq (`2 )(projB·C·A (A) − A). Note that one can find such a B, w.l.o.g., since rank(A) > r. Furthermore, we can assume CA has full row rank since rank(A) ≥ r. Note that if CA does not have this property, there is some C 0 A with this property whose rowspace contains the rowspace of CA, so this is without loss of generality. Fix any row Ai of A, and consider the y that minimizes min ||yBCA − Ai ||2 . and ∆O(dj ) assignments to these rows. They contain j y∈Rj linearly independent columns, and once we fix the values of other rows on these columns, this fixes the other rows. It is well-known that 2 In total, the number of F is ∆O(j (d+n)) . t u♣ y = Ai (BCA)T [(BCA)(BCA)T ]−1 , We now show how to achieve polynomial time complexity. We need a theorem of Shyamalkumar and Varadarajan but since BCA has orthonormal rows, y = Ai (BCA)T . (stated for subspaces instead of flats). For conforming matrices we have ||y||2 ≤ ||Ai ||2√||BCA||F . Since BCA has orthonormal rows, ||BCA||F = j. MoreT HEOREM 7.4. ([25]) Let A be an n × d matrix. There over, ||Ai || ≤ √d∆. It follows that 2 j is a subset Q of O( log 1 ) rows of A so that span(Q) p contains a j-dimensional subspace F with ||y||∞ ≤ ||y||2 ≤ dj∆ ≤ ∆2 . Fq (`2 )(projF (A) − A) ≤ (1 + )Fq (`2 )(Aq j,2 − A). Consider the expression ||yBH − a||2 , where a = Ai for some i, and H = CA. The entries of both a and C are def Let r = O( j log 1 ). Theorem 7.4 says that given a integers bounded in magnitude by ∆. matrix A, there is a j × r matrix B and an r × n subsetLet the r rows of H be H1 , . . . , Hr , and the j rows of matrix C (i.e., a binary matrix with one 1 in each row and BH be at most one 1 in each column), with r r r X X X q B H , B H , , . . . , Bj,` H` . 1,` ` 2,` ` Fq (`2 )(projB·C·A (A) − A) ≤ (1 + )Fq (`2 )(Aj,2 − A). `=1 r `=1 `=1 We enumerate all possible C in n time, which is Then the expression ||yBH − a||22 has the form polynomial for j/ = O(1), but we need a technical lemma to discretize the possible B. !2 j X d r X X av − yu Bu,` H`,v . L EMMA 7.4. Suppose rank(A) > j. Then there is a (7.15) 3 2 2 v=1 u=1 `=1 discrete set of ∆O(j log 1/)/ different B, each with entries that are integer multiples of ∆−Θ(j) and bounded Notice that |yu H`,v | ≤ ∆3 for every u, `, and v. It follows in magnitude by ∆O(j log 1/)/ , so that for every A, there that if we replace B with the matrix B 0 , in which entries is a B in the set and a subset-matrix C with are rounded down to the nearest multiple of ∆−cj for a constant c > 0, then a routine calculation shows that Fq (`2 )(projB·C·A (A) − A) ≤ (1 + )Fq (`2 )(Aq j,2 − A). expression 7.15 changes by at most ∆−Θ(cj) , where the constant in the Θ(·) does not depend on c. As this was for We sketch each row Ai of A independently, treating it one particular row a = Ai , it follows by another routine as the vector b in Theorem 7.2 with the d there equaling the j here, thereby obtaining Ai S for sketching matrix calculation that S and each i ∈ [n]. Offline, we guess each of nr · 3 2 2 Fq (`2 )(projB·C·A (A) − A) ∆O(j log 1/)/ matrix products BC, and by linearity (7.16)≤ Fq (`2 )(projB 0 ·C·A (A) − A) compute BCAS. We can (1 ± )-approximate −Θ(cj) ≤ Fq (`2 )(projB·C·A (A) − A) + ∆ . min ||xBCA − Ai ||p x∈Rj We would like to argue that the RHS of inequality 7.16 can be turned into a relative error. For this, we appeal for each i, B, C provided S is a d × O(j3 log2 1/)/2 to Corollary 7.2, which shows that if rank(A) ≥ j + 1, matrix. Finally by Theorem 7.1, the error incurred must be at least ∆−Θ(j) . Since can be assumed to be at least ∆−Θ(j) , as otherwise the T HEOREM 7.5. There is a 1-pass algorithm for Best problem reduces to exact computation, it follows that if Rank-j Approximation with rank(A) ≥ j + 1, then for a large constant c > 0, O(nj4 log(nd) log3 1/)/5 Fq (`2 )(projB·C·A (A) − A) bits of space and ∆poly(j/) time. The algorithm also ≤ Fq (`2 )(projB 0 ·C·A (A) − A) obtains the n × j basis representation of rows of A ≤ (1 + )Fq (`2 )(projB·C·A (A) − A). in BC for the choice of B and C resulting in a (1 + )-approximation. In another pass we can obtain the 0 It remains to bound the number of different B . subspace BCA in O(jd log(nd)) additional bits of space. Observe that ||B 0 ||F ≤ ||B||F . Now, B = B(CA)(CA)− , References where M− denotes the Moore-Penrose inverse of a matrix M (that is, if we write M = UΣV T using the singular value decomposition, then M− = VΣ−1 UT ). Here we use the fact that CA has full row rank. Hence, p ||B||F ≤ ||BCA||F ||(CA)− ||F = j||(CA)− ||F , since the rows of BCA are orthonormal. rank(CA). Now, ||(CA)− ||2F = e X [1] P. K. Agarwal, S. Har-Peled, and K.R. Varadarajan. Approximating extent measures of points, J. ACM, volume 51(4), pp. 606–635, 2004. [2] N. Alon and J. Spencer, The probabilistic method, J. Wiley & Sons, New York, 2nd edition, 2000. [3] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, volume 15(6), pp. 1373–1396, 2003. [4] M. Belkin and P. Niyogi, Semi-supervised learning on riemannian manifolds, Machine Learning Journal, 56, 2004. [5] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, An information statistics approach to data stream and communication complexity., In Proc. 43th Annu. IEEE Symp. Found. Comput. Sci. (FOCS), pp. 209–218, 2002. [6] A. Chakrabarti, S. Khot, and X. Sun, Near-optimal lower bounds on the multi-party communication complexity of set disjointness, In Proc. 18th Ann. IEEE Conf. on Computational Complexity (CCC), pp. 107–117, 2003. [7] K. Clarkson and D. Woodruff, Numerical linear algebra in the streaming model, Proc. 41th Annu. ACM Symp. Theory Comput. (STOC), 2009. [8] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney. Sampling algorithms and coresets for lp regression. Siam J. Comput., 38(5), pp. 2060-2078, 2009. [9] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, Matrix approximation and projective clustering via volume sampling, Theory of Computing, volume 2(12), pp. 225– 247, 2006. Let e = σ−2 i , i=1 where σ1 ≥ σ2 ≥ · · · σe > 0 are the non-zero singular values of CA. Since CA is an integer matrix with entries bounded in magnitude by ∆, by Lemma 7.1 all singular values of CA are at least 1/∆Θ(e) , and thus ||(CA)− ||2F ≤ e∆Θ(e) ≤ ∆O(j/ log 1/) . In summary, ||B 0 ||F ≤ ||B||F ≤ ∆O(j/ log 1/) . As B 0 contains entries that are integer multiples of ∆−cj , the number of different values of an entry in B 0 is ∆O(j/ log 1/) . Since B 0 is a j × r matrix, where r = O(j/ log 1/), it follows that the number of different B 0 3 2 2 is ∆O(j / log 1/) , which completes the proof. t u♣ 19 [10] A. Deshpande and K. Varadarajan, Sampling-based dimension reduction for subspace approximation, Proc. 39th Annu. ACM Symp. Theory Comput. (STOC), 2007. [11] D. Feldman, A. Fiat, and M. Sharir, Coresets for weighted facilities and their applications, Proc. 47th Annu. IEEE Symp. Found. Comput. Sci. (FOCS), 2006. [12] A. Frieze, R. Kannan, and S. Vempala, Fast monte-carlo algorithms for finding low-rank approximations, JACM, 32(2), pp. 269–288, 2004. [13] D. Feldman and M. Landberg, Algorithms for approximating subspaces by subspaces, 2007. [14] D. Feldman, M. Monemizadeh, and C. Sohler, A ptas for kmeans clustering based on weak coresets, Proc. 23st Annu. ACM Sympos. Comput. Geom. (SOCG), pp. 11–18, 2007. [15] S. Har-Peled, How to get close to the median shape, CGTA, 36(1), pp. 39–51, 2007. [16] S. Har-Peled and S. Mazumdar, Coresets for k-means and k-median clustering and their applications, Proc. 36th Annu. ACM Sympos. Theory Comput. (STOC), pp. 291– 300, 2004. [17] P. Indyk, Stable distributions, pseudorandom generators, embeddings, and data stream computation, J. ACM, 53(3), pp. 307–323, 2006. [18] T.S. Jayram and D. Woodruff, The data stream space complexity of cascaded norms, In Proc. 50th Annu. IEEE Symp. Found. Comput. Sci. (FOCS), FOCS, 2009. [19] D. Kane, J. Nelson, and D. Woodruff. On the Exact Space Complexity of Sketching and Streaming Small Norms. In Proc. 21th ACM-SIAM Symp. Discrete Algorithms (SODA), 2010. [20] J. Kleinberg and M. Sandler, Using mixture models for collaborative filtering, In Proc. 36th Annu. ACM Sympos. Theory Comput. (STOC), pp. 569– 578, 2004. [21] D. Kuzmin and M. K. Warmuth, Online kernel PCA with entropic matrix updates, In Proc. 24th Intl. Conf. for Machine Learning , pp. 465–472, 2007. [22] A. Lakhina, M. Crovella, and C. Diot, Characterization of network-wide anomalies in traffic flows. In Proc. 4th ACM SIGCOMM Conf. on Internet measurement, pp. 201–206, 2004. [23] P. Li, Estimators and tail bounds for dimension reduction in `α (0 < α ≤ 2) using stable random projections. In SODA, pp. 10–19, 2008. [24] T. Sarlós. Improved approximation algorithms for large matrices via random projections. In Proc. 47th Annu. IEEE Symp. Found. Comput. Sci. (FOCS), pp. 143–152, 2006. [25] N.D. Shyamalkumar and K. Varadrajan. Efficient subspace approximation algorithms. In Proc. 18th ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 532–540, 2007. [26] M. Vose. A linear algorithm for generating random numbers with a given distribution. In IEEE Trans. Software Eng., 17:9, pp. 972–975, 1991.