Lower Bounds for Sparse Recovery∗ Khanh Do Ba MIT CSAIL Piotr Indyk MIT CSAIL Abstract We consider the following k-sparse recovery problem: design an m × n matrix A, such that for any signal x, given Ax we can efficiently recover x̂ satisfying kx − x̂k1 ≤ C mink-sparse x0 kx − x0 k1 . It is known that there exist matrices A with this property that have only O(k log(n/k)) rows. In this paper we show that this bound is tight. Our bound holds even for the more general randomized version of the problem, where A is a random variable, and the recovery algorithm is required to work for any fixed x with constant probability (over A). 1 Introduction In recent years, a new “linear” approach for obtaining a succinct approximate representation of ndimensional vectors (or signals) has been discovered. For any signal x, the representation is equal to Ax, where A is an m × n matrix. The vector Ax is often referred to as the measurement vector or sketch of x. Although m is typically much smaller than n, the sketch Ax contains plenty of useful information about the signal x. A particularly useful and well-studied problem is that of stable sparse recovery: given Ax, recover a k-sparse vector x̂ (i.e., having at most k non-zero components) such that (1.1) kx − x̂kp ≤ C min k-sparse x0 kx − x0 kq Eric Price MIT CSAIL David P. Woodruff IBM Almaden length m = O(k log(n/k)). In particular, a random Gaussian matrix [CRT06]1 or a random sparse binary matrix ([BGI+ 08], building on [CCFC04, CM05]) has this property with overwhelming probability. In comparison, using a non-linear approach, one can obtain a shorter sketch of length O(k): it suffices to store the k coefficients with the largest absolute values, together with their indices. Surprisingly, it was not known if the O(k log(n/k)) bound for linear sketching could be improved upon2 , although O(k) sketch length was known to suffice if the signal vectors x are required to be exactly k-sparse. This raised hope that the O(k) bound might be achievable even for general vectors x. Such a scheme would have been of major practical interest, since the sketch length determines the compression ratio, and for large n any extra log n factor worsens that ratio tenfold. In this paper we show that, unfortunately, such an improvement is not possible. We address two types of recovery schemes: • A deterministic one, which involves a fixed matrix A and a recovery algorithm which work for all signals x. The aforementioned results of [CRT06] and others are examples of such schemes. • A randomized one, where the matrix A is chosen at random from some distribution, and for each signal x the recovery procedure is correct with constant probability. Some of the early schemes proposed in the data stream literature (e.g., [CCFC04, CM05]) belong to this category. for some norm parameters p and q and an approximation factor C = C(k). Sparse recovery has applications to numerous areas such as data stream computing [Mut03, Ind07] and compressed sensing [CRT06, Don06, DDT+ 08]. Our main result is that, even in the ranIt is known that there exist matrices A and as- domized case, the sketch length m must be at sociated recovery algorithms that produce approxi- least Ω(k log(n/k)). By the aforementioned result mations x̂ satisfying Equation (1.1) with p = q = 1 of [CRT06] this bound is tight. (i.e., the ”`1 /`1 guarantee”), constant C and sketch ∗ This research has been supported in part by David and Lucille Packard Fellowship, MADALGO (Center for Massive Data Algorithmics, funded by the Danish National Research Association) and NSF grant CCF-0728645. E. Price has been supported in part by Cisco Fellowship. 1 In fact, they even achieve a somewhat stronger ` /` 2 1 guarantee, see Section 1.2. 2 The lower bound of Ω(k log(n/k)) was known to hold for specific recovery algorithms, specific matrix types, or other recovery scenarios. See Section 1.2 for an overview. Thus, our results show that the linear compres- O(log n) bits. In this setting we show the followsion is inherently more costly than the simple non- ing: there is a method for encoding a sequence of linear approach. d = O(k log(n/k) log n) bits into a vector x, so that any sparse recovery algorithm can recover that se1.1 Our techniques On a high level, our ap- quence given Ax. Since each entry of Ax conveys proach is simple and natural, and utilizes the pack- only O(log n) bits, it follows that the number m of ing approach: we show that any two “sufficiently” rows of A must be Ω(k log(n/k)). different vectors x and x0 are mapped to images Ax The encoding is performed by taking and Ax0 that are “sufficiently” different themselves, log Xn which requires that the image space is “sufficiently” x= D j xj , high-dimensional. However, the actual arguments are j=1 somewhat subtle. Consider first the (simpler) deterministic case. where D = O(1) and the xj ’s are chosen from the We focus on signals x = y+z, where y can be thought error-correcting code Y defined as in the determinisof as the “head” of the signal and z as the “tail”. The tic case. The intuition behind this approach is that “head” vectors y come from a set Y that is a binary a good `1 /`1 approximation to x reveals most of the error-correcting code, with a minimum distance Ω(k), bits of xlog n . This enables us to identify xlog n exwhere each codeword has weight k. On the other actly using error correction. We could then comhand, the “tail” vectors z come from an `1 ball (say pute Ax − Axlog n = A(Plog n−1 Dj xj ), and identify j=1 B) with a radius that is a small fraction of k. It xlog n−1 . . . x1 in a recursive manner. The only obcan be seen that for any two elements y, y 0 ∈ Y , the stacle to completing this argument is that we would balls y + B and y 0 + B, as well as their images, must need the recovery algorithm to work for all xi , which be disjoint. At the same time, since all vectors x would require lower probability of algorithm failure live in a “large” `1 ball B 0 of radius O(k), all images (roughly 1/ log n). To overcome this problem, we Ax must live in a set AB 0 . The key observation is replace the encoding argument by a reduction from that the set AB 0 is a scaled version of A(y + B) and a related communication complexity problem called therefore the ratios of their volumes can be bounded Augmented Indexing. This problem has been used in by the scaling factor to the power of the dimension the data stream literature [CW09, KNW10] to prove m. Since the number of elements of Y is large, this lower bounds for linear algebra and norm estimagives a lower bound on m. tion problems. Since the problem has communication Unfortunately, the aforementioned approach complexity of Ω(d), the conclusion follows. does not seem to extend to the randomized case. A We apply the argument to arbitrary matrices A natural approach would be to use Yao’s principle, and by representing them as a sum A0 + A00 , where A0 has focus on showing a lower bound for a scenario where O(log n) bits of precision and A00 has “small” entries. the matrix A is fixed while the vectors x = y + z are We then show that A0 x = A(x + s) for some s with “random”. However, this approach fails, in a very ksk < n−Ω(1) kxk . In the communication game, 1 1 strong sense. Specifically, we are able to show that this means we can transmit A0 x and recover xlog n there is a distribution over matrices A with only O(k) from A0 (Plog n Dj x ) = A(Plog n Dj x + s). This j j j=1 rows so that for a fixed y ∈ Y and z chosen uniformly means that j=1 the Augmented Indexing reduction applies at random from the small ball B, we can recover y to arbitrary matrices as well. from A(y + z) with high probability. In a nutshell, the reason is that a random vector from B has an 1.2 Related Work There have been a number `2 norm that is much smaller than the `2 norm of of earlier works that have, directly or indirectly, elements of Y (even though the `1 norms are com- shown lower bounds for various models of sparse parable). This means that the vector x is “almost” recovery and certain classes of matrices and algok-sparse (in the `2 norm), which enables us to achieve rithms. Specifically, one of the most well-known rethe O(k) measurement bound. covery algorithms used in compressed sensing is `1 Instead, we resort to an altogether different ap- minimization, where a signal x ∈ Rn measured by proach, via communication complexity [KN97]. We matrix A is reconstructed as start by considering a “discrete” scenario where both the matrix A and the vectors x have entries rex̂ := arg min kx0 k1 . x0 : Ax0 =Ax stricted to the polynomial range {−nc . . . nc } for some c = O(1). In other words, we assume that the Kashin and Temlyakov [KT07] gave a characterizamatrix and vector entries can be represented using tion of matrices A for which the above recovery algo- rithm yields the `2 /`1 guarantee, i.e., kx − x̂k2 ≤ Ck −1/2 min k-sparse x0 kx − x0 k1 for some constant C, from which it can be shown that such an A must have m = Ω(k log(n/k)) rows. Note that the `2 /`1 guarantee is somewhat stronger than the `1 /`1 guarantee investigated in this paper. Specifically, it is easy to observe that if the approximation x̂ itself is required to be O(k)-sparse, then the `2 /`1 guarantee implies the `1 /`1 guarantee (with a somewhat higher approximation constant). For the sake of simplicity, in this paper we focus mostly on the `1 /`1 guarantee. However, our lower bounds apply to the `2 /`1 guarantee as well: see footnote on page 7. On the other hand, instead of assuming a specific recovery algorithm, Wainwright [Wai07] assumes a specific (randomized) measurement matrix. More specifically, the author assumes a k-sparse binary signal x ∈ {0, α}n , for some α > 0, to which is added i.i.d. standard Gaussian noise in each component. The author then shows that with a random Gaussian matrix A, with each entry also drawn i.i.d. from the standard Gaussian, we cannot hope to recover x from Ax with any sub-constant probability of error unless A has m = Ω( α12 log nk ) rows. The p author also shows that for α = 1/k, this is tight, i.e., that m = Θ(k log(n/k)) is both necessary and sufficient. Although this is only a lower bound for a specific (random) matrix, it is a fairly powerful one and provides evidence that the often observed upper bound of O(k log(n/k)) is likely tight. More recently, Dai and Milenkovic [DM08], extending on [EG88] and [FR99], showed an upper bound on superimposed codes that translates to a lower bound on the number of rows in a compressed sensing matrix that deals only with k-sparse signals but can tolerate measurement noise. Specifically, if we assume a k-sparse signal x ∈ ([−t, t] ∩ Z)n , and that arbitrary noise µ ∈ Rn with kµk1 < d is added to the measurement vector Ax, then if exact recovery is still possible, A must have had m ≥ Ck log n/ log k rows, for some constant C = C(t, d) and sufficiently large n and k.3 (2.2) kx − x̂k1 ≤ C min k-sparse x0 kx − x0 k1 . We define a C-approximate deterministic `1 /`1 recovery algorithm to be a pair (A, A ) where A is an m × n observation matrix and A is an algorithm that, for any x, maps Ax (called the sketch of x) to some x̂ that satisfies Equation (2.2). We define a C-approximate randomized `1 /`1 recovery algorithm to be a pair (A, A ) where A is a random variable chosen from some distribution over m × n measurement matrices, and A is an algorithm which, for any x, maps a pair (A, Ax) to some x̂ that satisfies Equation (2.2) with probability at least 3/4. We use Bpn (r) to denote the `p ball of radius r in n R ; we skip the superscript n if it is clear from the context. For any vector x, we use kxk0 to denote the “`0 norm of x”, i.e., the number of non-zero entries in x. 3 Deterministic Lower Bound We will prove a lower bound on m for any Capproximate deterministic recovery algorithm. First we use a discrete volume bound (Lemma 3.1) to find a large set Y of points that are at least k apart from each other. Then we use another volume bound (Lemma 3.2) on the images of small `1 balls around each point in Y . If m is too small, some two images collide. But the recovery algorithm, applied to a point in the collision, must yield an answer close to two points in Y . This is impossible, so m must be large. Lemma 3.1. (Gilbert-Varshamov) For any q, k ∈ Z+ , ∈ R+ with < 1 − 1/q, there exists a set Y ⊂ {0, 1}qk of binary vectors with exactly k ones, such that Y has minimum Hamming distance 2k and log |Y | > (1 − Hq ())k log q where Hq is the q-ary entropy function Hq (x) = x − (1 − x) logq (1 − x). −x logq q−1 See appendix for proof. Lemma 3.2. Take an m × n real matrix A, positive reals , p, λ, and Y ⊂ Bpn (λ). If |Y | > (1+1/)m , then n In this paper we focus on recovering sparse approx- there exist z, z ∈ Bp (λ) and y, y ∈ Y with y 6= y and A(y + z) = A(y + z). imations x̂ that satisfy the following C-approximate `1 /`1 guarantee with sparsity parameter k: Proof. If the statement is false, then the images of all |Y | balls {y + Bpn (λ) | y ∈ Y } are disjoint. However, 3 Here A is assumed to have its columns normalized to have n `1 -norm 1. This is natural since otherwise we could simply those balls all lie within Bp ((1+)λ), by the bound on scale A up to make the image points Ax arbitrarily far apart, the norm of Y . A volume argument gives the result, effectively nullifying the noise. as follows. 2 Preliminaries Let S = ABpn (1) be the image of the ndimensional ball of radius 1 in m-dimensional space. This is a polytope with some volume V . The image of Bpn (λ) is a linearly scaled S with volume (λ)m V , and the volume of the image of Bpn ((1+)λ) is similar with volume ((1 + )λ)m V . If the images of the former are all disjoint and lie inside the latter, we have |Y | (λ)m V ≤ ((1 + )λ)m V , or |Y | ≤ (1 + 1/)m . If Y has more elements than this, the images of some two balls y + Bpn (λ) and y + Bpn (λ) must intersect, implying the lemma. a distribution Z, such that any algorithm given the sketch of y + z must recover an incorrect y with nonnegligible probability. Using our deterministic bound as inspiration, we could take Y to be uniform over a set of ksparse binary vectors of minimum Hamming distance k and Z to be uniform over the ball B1 (γk) for some constant γ > 0. Unfortunately, as the following theorem shows, one can actually perform a recovery of such vectors using only O(k) measurements; √ this is because kzk2 is very small (namely, Õ(k/ n)) with high probability. Theorem 3.1. Any C-approximate deterministic recovery algorithm must have Theorem 4.1. Let Y ⊂ Rn be a set of signals with the property that for every distinct y1 , y2 ∈ Y , ky1 − jnk 1 − Hbn/kc (1/2) y2 k2 ≥ r, for some parameter r > 0. Consider “noisy . k log m≥ log(4 + 2C) k signals” x = y + z, where y ∈ Y and z is a “noise vector” chosen uniformly at random from B1 (s), for Proof. Let Y be a maximal set of k-sparse n- another parameter s > 0. Then using an m × n dimensional binary vectors with minimum Ham- Gaussian measurement matrix A = (1/√m)(g ), ij 1 ming distance k, and let γ = 3+2C . By where g ’s are i.i.d. standard Gaussians, we can ij Lemma 3.1 with q = bn/kc we have log |Y | > (1 − recover y ∈ Y from A(y + z) with probability 1 − 1/n Hbn/kc (1/2))k log bn/kc. (where the probability is over both A and z), as long Suppose that the theorem is not true; then as ! m < log |Y | / log(4 + 2C) = log |Y | / log(1 + 1/γ), rm1/2 n1/2−1/m 1 m or |Y | > (1 + γ ) . Hence Lemma 3.2 gives us some s≤O . 1/m log3/2 n |Y | y, y ∈ Y and z, z ∈ B1 (γk) with A(y + z) = A(y + z). Let w be the result of running the recovery To prove the theorem we will need the following algorithm on A(y + z). By the definition of a two lemmas. deterministic recovery algorithm, we have ky + z − wk1 ≤ C min k-sparse y 0 Lemma 4.1. For any δ > 0, y1 , y2 ∈ Y , y1 6= y2 , and z ∈ Rn , each of the following holds with probability at least 1 − δ: ky + z − y 0 k1 ky − wk1 − kzk1 ≤ C kzk1 ky − wk1 ≤ (1 + C) kzk1 ≤ (1 + C)γk = and similarly ky − wk1 ≤ 1+C 3+2C 1+C 3+2C k, k, so ky − yk1 ≤ ky − wk1 + ky − wk1 = 2 + 2C k < k. 3 + 2C But this contradicts the definition of Y , so m must be large enough for the guarantee to hold. Corollary 3.1. If C is a constant bounded away from zero, then m = Ω(k log(n/k)). 4 Randomized Upper Bound for Uniform Noise The standard way to prove a randomized lower bound is to find a distribution of hard inputs, and to show that any deterministic algorithm is likely to fail on that distribution. In our context, we would like to define a “head” random variable y from a distribution Y and a “tail” random variable z from 1/m • kA(y1 − y2 )k2 ≥ δ 3 ky1 − y2 k2 , and p • kAzk2 ≤ ( (8/m) log(1/δ) + 1)kzk2 . See the appendix for the proof. Lemma 4.2. A random vector z chosen uniformly from B1 (s) satisfies √ Pr[kzk2 > αs log n/ n] < 1/nα−1 . See the appendix for the proof. Proof of theorem. In words, Lemma 4.1 says that A cannot bring faraway signal points too close together, and cannot blow up a small noise vector too much. Now, we already assumed the signals to be far apart, and Lemma 4.2 tells us that the noise is indeed small (in `2 distance). The result is that in the image space, the noise is not enough to confuse different signals. Quantitatively, applying the second part of Lemma 4.1 with δ = 1/n2 , and Lemma 4.2 with α = 3, gives us (4.3) ! ! s log3/2 n log1/2 n kzk2 ≤ O kAzk2 ≤ O m1/2 (mn)1/2 Proof. The parameters in this case are r = k and |Y | ≤ nk ≤ (ne/k)k , so by Theorem 4.1, it suffices to have 3/2+k/m 1/2−(k+1)/m k n s≤O . log3/2 n Choosing m = (k + 1)/ yields the corollary. with probability ≥ 1 − 2/n2 . On the other hand, given signal y1 ∈ Y , we know that every other signal 5 Randomized Lower Bound y2 ∈ Y satisfies ky1 − y2 k2 ≥ r, so by the first part of Lemma 4.1 with δ = 1/(2n|Y |), together with a Although it is possible to partially circumvent this obstacle by focusing our noise distribution on “high” union bound over every y2 ∈ Y , `2 norm, sparse vectors, we are able to obtain stronger r ky1 − y2 k2 results via a reduction from a communication game ≥ (4.4) kA(y1 − y2 )k2 ≥ and the corresponding lower bound. 3(2n|Y |)1/m 3(2n|Y |)1/m The communication game will show that a mesholds for every y2 ∈ Y , y2 6= y1 , simultaneously with sage Ax must have a large number of bits. To show probability 1 − 1/(2n). that this implies a lower bound on the number of rows Finally, observe that as long as kAzk2 < kA(y1 − of A, we will need A to be discrete. Hence we first y2 )k2 /2 for every competing signal y2 ∈ Y , we are show that discretizing A does not change its recovery guaranteed that characteristics by much. kA(y1 + z) − Ay1 k2 = kAzk2 5.1 Discretizing Matrices Before we discretize by rounding, we need to ensure that the matrix is well conditioned. We show that without loss of generality, ≤ kA(y1 + z) − Ay2 k2 the rows of A are orthonormal. We can multiply A on the left by any invertible for every y2 6= y1 , so we can recover y1 by simply matrix to get another measurement matrix with the returning the signal whose image is closest to our same recovery characteristics. If we consider the measurement point A(y1 + z) in `2 distance. To singular value decomposition A = U ΣV ∗ , where U achieve this, we can chain Equations (4.3) and (4.4) and V are orthonormal and Σ is 0 off the diagonal, together (with a factor of 2), to see that this means that we can eliminate U and make the ! entries of Σ be either 0 or 1. The result is a matrix rm1/2 n1/2−1/m consisting of m orthonormal rows. For such matrices, s≤O |Y |1/m log3/2 n we prove the following: < kA(y1 − y2 )k2 − kAzk2 Lemma 5.1. Consider any m × n matrix A with orthonormal rows. Let A0 be the result of rounding A to b bits per entry. Then for any v ∈ Rn there The main consequence of this theorem is that for exists an s ∈ Rn with A0 v = A(v − s) and ksk1 < the setup we used in Section 3 to prove a deterministic n2 2−b kvk1 . lower bound of Ω(k log(n/k)), if we simply draw the Proof. Let A00 = A − A0 be the roundoff error when noise uniformly randomly from the same `1 ball (in discretizing A to b bits, so each entry of A00 is less fact, even one with a much larger radius, namely, than 2−b . Then for any v and s = AT A00 v, we have polynomial in n), this “hard distribution” can be As = A00 v and defeated with just O(k) measurements: √ ksk1 = AT A00 v 1 ≤ n kA00 vk1 √ Corollary 4.1. If Y is a set of binary k-sparse vec≤ m n2−b kvk1 ≤ n2 2−b kvk1 . tors, as in Section 3, and noise z is drawn uniformly at random from B1 (s), then for any constant > 0, 5.2 Communication Complexity We use a few m = O(k/) measurements suffice to recover any sig- definitions and results from two-party communicanal in Y with probability 1 − 1/n, as long as tion complexity. For further background see the book suffices. Our total probability of failure is at most 2/n2 + 1/(2n) < 1/n. s≤O k 3/2+ n1/2− log3/2 n . by Kushilevitz and Nisan [KN97]. Consider the following communication game. There are two parties, Alice and Bob. Alice is given a string y ∈ {0, 1}d . Bob is given an index i ∈ [d], together with yi+1 , yi+2 , . . . , yd . The parties also share an arbitrarily long common random string r. Alice sends a single message M (y, r) to Bob, who must output yi with probability at least 3/4, where the probability is taken over r. We refer to this problem as Augmented Indexing. The communication cost of Augmented Indexing is the minimum, over all correct protocols, of the length of the message M (y, r) on the worst-case choice of r and y. The next theorem is well-known and follows from Lemma 13 of [MNSW98] (see also Lemma 2 of [BYJKK04]). distance k. From Lemma 3.1 we have log |X| = Ω(k log(n/k)). Let d = blog |X|c log n, and define D = 2C + 3. Alice is given a string y ∈ {0, 1}d , and Bob is given i ∈ [d] together with yi+1 , yi+2 , . . . , yd , as in the setup for Augmented Indexing. Alice splits her string y into log n contiguous chunks y 1 , y 2 , . . . , y log n , each containing blog |X|c bits. She uses y j as an index into X to choose xj . Alice defines x = D1 x1 + D2 x2 + · · · + Dlog n xlog n . Alice and Bob use the common randomness r to Theorem 5.1. The communication cost of Augagree upon a random matrix A with orthonormal mented Indexing is Ω(d). rows. Both Alice and Bob round A to form A0 with Proof. First, consider the private-coin version of the b = d2(1 + log D) log ne = O(log n) bits per entry. 0 problem, in which both parties can toss coins, but do Alice computes A x and transmits it to Bob. From Bob’s input i, he can compute the value not share a random string r (i.e., there is no public j coin). Consider any correct protocol for this problem. j = j(i) for which the bit yi occurs in y . Bob’s We can assume the probability of error of the protocol input also contains yi+1 , . . . , yn , from which he can is an arbitrarily small positive constant by increasing reconstruct xj+1 , . . . , xlog n , and in particular can the length of Alice’s message by a constant factor compute (e.g., by independent repetition and a majority vote). z = Dj+1 xj+1 + Dj+2 xj+2 + · · · + Dlog n xlog n . Applying Lemma 13 of [MNSW98] (with, in their 0 notation, t = 1 and a = c · d for a sufficiently 0 0 small constant c0 > 0), the communication cost of Bob then computes A z, and using A x and linearity, 0 such a protocol must be Ω(d). Indeed, otherwise A (x − z). Then there would be a protocol in which Bob could output j X yi with probability greater than 1/2 without any D1+log n < kD2 log n . kx − zk ≤ kDi < k 1 interaction with Alice, contradicting that Pr[yi = D − 1 i=1 1/2] and that Bob has no information about yi . Our theorem now follows from Newman’s theorem (see, So from Lemma 5.1, there exists some s with A0 (x − e.g., Theorem 2.4 of [KNR99]), which shows that the z) = A(x − z − s) and communication cost of the best public coin protocol is at least that of the private coin protocol minus ksk1 < n2 2−2 log n−2 log D log n kx − zk1 < k. O(log d) (which also holds for one-round protocols). Set w = x − z − s. Bob then runs the estimation algorithm A on A and Aw, obtaining ŵ with the Theorem 5.2. For any randomized `1 /`1 recovery property that with probability at least 3/4, algorithm (A, A ), with approximation factor C = kw − ŵk1 ≤ C min kw − w0 k1 . O(1), A must have m = Ω(k log(n/k)) rows. k-sparse w0 Proof. We shall assume, without loss of generality, that n and k are powers of 2, that k divides n, and Now, that the rows of A are orthonormal. The proof for 0 w − Dj xj min kw − w k ≤ 1 the general case follows with minor modifications. 1 k-sparse w0 Let (A, A ) be such a recovery algorithm. We will j−1 X show how to solve the Augmented Indexing problem Di xi ≤ ksk1 + 1 on instances of size d = Ω(k log(n/k) log n) with i=1 communication cost O(m log n). The theorem will < k(1 + D + D2 + · · · + Dj−1 ) then follow by Theorem 5.1. Dj Let X be the maximal set of k-sparse n. <k· D−1 dimensional binary vectors with minimum Hamming 5.3 Randomized Lower Bound Theorem Hence j D xj − ŵ ≤ Dj xj − w + kw − ŵk 1 1 j1 ≤ (1 + C) D xj − w 1 kDj . < 2 And since the minimum Hamming distance in X is k, this means Dj xj − ŵ1 < Dj x0 − ŵ1 for all x0 ∈ X, x0 6= xj 4 . So Bob can correctly identify xj with probability at least 3/4. From xj he can recover y j , and hence the bit yi that occurs in y j . Hence, Bob solves Augmented Indexing with probability at least 3/4 given the message A0 x. The entries in A0 and x are polynomially bounded integers (up to scaling of A0 ), and so each entry of A0 x takes O(log n) bits to describe. Hence, the communication cost of this protocol is O(m log n). By Theorem 5.1, m log n = Ω(k log(n/k) log n), or m = Ω(k log(n/k)). References [BGI+ 08] R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. Allerton, 2008. [BYJKK04] Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar. The sketching complexity of pattern matching. RANDOM, 2004. [CCFC04] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3–15, 2004. [CM05] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005. [CRT06] E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1208– 1223, 2006. [CW09] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In STOC, pages 205–214, 2009. [DDT+ 08] M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk. Singlepixel imaging via compressive sampling. IEEE Signal Processing Magazine, 2008. [DM08] W. Dai and O. Milenkovic. Weighted superimposed codes and constrained integer compressed sensing. Preprint, 2008. 4 Note that these bounds would still hold with minor modification if we replaced the `1 /`1 guarantee with the `2 /`1 guarantee, so the same result holds in that case. [Don06] D. L. Donoho. Compressed Sensing. IEEE Trans. Info. Theory, 52(4):1289–1306, Apr. 2006. [EG88] T. Ericson and L. Györfi. Superimposed codes in Rn . IEEE Trans. on Information Theory, 34(4):877–880, 1988. [FR99] Z. Füredi and M. Ruszinkó. An improved upper bound of the rate of euclidean superimposed codes. IEEE Trans. on Information Theory, 45(2):799–802, 1999. [GSS08] S. Ganguly, A. N. Singh, and S. Shankar. Finding frequent items over general update streams. In SSDBM, pages 204–221, 2008. [IN07] P. Indyk and A. Naor. Nearest neighbor preserving embeddings. ACM Trans. on Algorithms, 3(3), Aug. 2007. [Ind07] P. Indyk. Sketching, streaming and sublinear-space algorithms. Graduate course notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007. [KN97] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997. [KNR99] I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Computational Complexity, 8(1):21–49, 1999. [KNW10] D. Kane, J. Nelson, and D. Woodruff. On the exact space complexity of sketching and streaming small norms. In SODA, 2010. [KT07] B. S. Kashin and V. N. Temlyakov. A remark on compressed sensing. Preprint, 2007. [MNSW98] P. B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication complexity. J. Comput. Syst. Sci., 57(1):37–49, 1998. [Mut03] S. Muthukrishnan. Data streams: Algorithms and applications (invited talk at soda’03). Available at http://athos.rutgers.edu/∼muthu/stream-1-1.ps, 2003. [vL98] J.H. van Lint. Introduction to coding theory. Springer, 1998. [Wai07] M. Wainwright. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. IEEE Int’l Symp. on Information Theory, 2007. A Proof of Lemma 3.1 Proof. We will construct a codebook T of block length k, alphabet q, and minimum Hamming distance k. Replacing each character i with the qlong standard basis vector ei will create a binary qk-dimensional codebook S with minimum Hamming distance 2k of the same size as T , where each element of S has exactly k ones. The Gilbert-Varshamov bound, based on volumes of Hamming balls, states that a codebook of size L exists for some L ≥ Pk−1 qk k i=0 i (q − 1)i . Using the claim (analogous to [vL98], p. 21, proven below) that for < 1 − 1/q k X k i=0 i (q − 1)i < q Hq ()k , In particular, for any α > 1, we have that log L > (1 − Hq ())k log q, as desired. i=0 i Then 1 = ( + (1 − ))k k X k i > (1 − )k−i i i=0 i k X k = (q − 1)i (1 − )k i (q − 1)(1 − ) i=0 k k X k i (q − 1) (1 − )k > i (q − 1)(1 − ) i=0 k X k −Hq ()k =q (q − 1)i i i=0 Proof of Lemma 4.1 Proof. By standard arguments (see, e.g., [IN07]), for any D > 0 we have m 3 ky1 − y2 k2 ≤ Pr kA(y1 − y2 )k2 ≤ D D and Pr[kAzk2 ≥ Dkzk2 ] ≤ e−m(D−1) 2 /8 . Setting both right-hand sides to δ yields the lemma. C αs log n/n] = (1 − α log n/n)n < e−α log n 1/nα . Now, by symmetry this holds for every other coordinate zi of z as well, so by the union bound (q − 1)i < q Hq ()k . Proof. Note that −Hq () (1 − ) < (1 − ). q = (q − 1)(1 − ) B Pr[|z1 | > = Claim A.1. For 0 < < 1 − 1/q, k X k proportional to (s − t)n−1 . Normalizing to ensure the probability integrates to 1, we derive this probability as n p(|z1 | = t) = n (s − t)n−1 . s It follows that, for any D ∈ [0, s], Z s n Pr[|z1 | > D] = (s − t)n−1 dt = (1 − D/s)n . n D s Proof of Lemma 4.2 Proof. Consider the distribution of a single coordinate of z, say, z1 . The probability density of |z1 | taking value t ∈ [0, s] is proportional to the (n − 1)(n−1) dimensional volume of B1 (s − t), which in turn is Pr[kzk∞ > αs log n/n] < 1/nα−1 , √ and since kzk2 ≤ n · kzk∞ for any vector z, the lemma follows.