Nonnegative Local Coordinate Factorization for Image Representation

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 3, MARCH 2013
969
Nonnegative Local Coordinate Factorization
for Image Representation
Yan Chen, Jiemi Zhang, Deng Cai, Member, IEEE, Wei Liu,
and Xiaofei He, Senior Member, IEEE
Abstract— Recently, nonnegative matrix factorization (NMF)
has become increasingly popular for feature extraction in computer vision and pattern recognition. NMF seeks two nonnegative matrices whose product can best approximate the original
matrix. The nonnegativity constraints lead to sparse parts-based
representations that can be more robust than nonsparse global
features. To obtain more accurate control over the sparseness,
in this paper, we propose a novel method called nonnegative
local coordinate factorization (NLCF) for feature extraction.
NLCF adds a local coordinate constraint into the standard NMF
objective function. Specifically, we require that the learned basis
vectors be as close to the original data points as possible. In this
way, each data point can be represented by a linear combination
of only a few nearby basis vectors, which naturally leads to sparse
representation. Extensive experimental results suggest that the
proposed approach provides a better representation and achieves
higher accuracy in image clustering.
Index Terms— Local coordinate coding, nonnegative matrix
factorization, sparse learning.
I. I NTRODUCTION
D
ATA REPRESENTATION has multiple meanings and
goals, arising from many applications, such as computer
vision and pattern recognition. In these fields, the input data
matrix is often of very high dimension, which may make
learning from example infeasible [1]. One hopes to reduce
the dimension by using feature extraction techniques, such as
Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Non-negative Matrix Factorization (NMF, [2])
and other methods [3]–[5].
PCA is one of the most popular dimensionality reduction
methods. Its operation can be thought of as revealing the
internal structure of the data in a way which best explains
the variance in the data. The basis of PCA are orthogonal and
Manuscript received March 6, 2012; revised July 19, 2012; accepted
September 29, 2012. Date of publication October 12, 2012; date of current
version January 24, 2013. This work was supported in part by the National
Basic Research Program of China (973 Program) under Grant 2013CB336500,
and the National Natural Science Foundation of China under Grant 61222207,
Grant 61125203, and Grant 91120302. The associate editor coordinating the
review of this manuscript and approving it for publication was Prof. Mark
Liao. (Corresponding author: D. Cai.)
Y. Chen, J. Zhang, D. Cai, and X. He are with the State Key
Laboratory of CAD&CG, College of Computer Science, Zhejiang University,
Hangzhou, Zhejiang 310058, China (e-mail: [email protected];
[email protected]; [email protected]; [email protected]).
W. Liu is with the IBM Thomas J. Watson Research Center, Yorktown
Heights, NY 10598 USA (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2012.2224357
have a statistical interpretation as the directions of largest variance. Recently, matrix factorization techniques have become
increasing popular for feature extraction. One of the most
frequently used matrix factorization techniques is SVD, which
provides a low-rank approximation to the original matrix.
This approximation is optimal in the sense of reconstruction
error and thus optimal for data representation when Euclidean
structure is concerned.
Unlike PCA and SVD, NMF seeks for two non-negative
matrices whose product can best approximate the original
matrix. Previous studies have shown there is psychological and
physiological evidence for parts-based representation in human
brain [6]–[8]. The NMF codes naturally favor sparse, partsbased representations which in the context of classification
and regression can be more robust than non-sparse, global
representations [9]. Due to the non-negativity constraints of
NMF, it models each data point as additive, not subtractive,
combination of the underlying clusters. However, NMF does
not always result in sparse representation [10]. Hoyer extended
NMF to include the option to control sparseness explicitly by
adding a L1 norm minimization on the factor matrices, which
allows us to discover sparse representations better than those
given by standard NMF [11]. It would be important to note
that, however, Hoyer’s approach does not directly ensure the
sparseness of the new representation of a data point. Instead,
it ensures the sparseness of a new feature corresponding to a
basis vector. Thus, although in average the new representations
of the data points can be very sparse, theoretically it is possible that the new representations for some points are highly
sparse while for the others the new representations are highly
dense.
In this paper we propose a novel matrix factorization
algorithm, called Non-negative Local Coordinate Factorization (NLCF), which adds a local coordinate constraint to
ensure the sparseness of the obtained representations. Our
algorithm is motivated by many recent progresses on sparse
coding, and particularly, Local Coordinate Coding proposed by
Yu et al. [12], [13]. Specifically, we require that the learned
basis vectors be as close to the data points as possible. In this
way, each data point can be represented by a linear combination of only few nearby basis vectors and, thus, the sparseness
of the obtained representations can be guaranteed. An optimization scheme has been developed to solve the objective
function based on iterative updates of the two factor matrices.
It is important to note that NLCF is unsupervised, which is
fundamentally different form [14] which is supervised.
1057–7149/$31.00 © 2012 IEEE
970
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 3, MARCH 2013
The rest of the paper is organized as follows: Section 2
gives a brief review of NMF. In Section 3, we introduce
our NLCF algorithm, as well as the optimization scheme,
convergence study and computational complexity analysis.
Extensive experimental results are presented in Section 4.
Finally, we conclude in Section 5.
The above constraint incurs a heavy penalty if xi is far away
from the anchor point uk while its new coordinate v ki with
respect to uk is large. Therefore, minimizing it is an attempt
to ensure that if xi is sufficiently close to the anchor point uk
then its new coordinate with respect to uk tends to be one.
By incorporating the local coordinate constraint Q into
the standard NMF objective function, we get the following
minimization problem:
II. B RIEF R EVIEW OF NMF
NMF tries to decompose a non-negative M × N matrix X
into two non-negative factor matrices U = (u1 , . . . , u K ) ∈
R M×K and V = (v1 , . . . , v N ) ∈ R K ×N . There are different
criteria to measure the quality of the decomposition. Lee et al.
proposed two objective functions in [15]: the Euclidean distance between X and UV [16] and the KL divergence [7]. The
Euclidean distance based objective function is expressed as:
O = X − UV2F
N xi − Uvi 2 + μ
K
i=1
|v ki | uk − xi 2
k=1
U = [u1 , . . . , u K ] ∈ R M×K > 0
V = [v1 , . . . , v N ] ∈ R K ×N > 0
s.t.
(3)
where μ ≥ 0 is a regularization parameter.
(1)
where || · || F denotes the matrix Frobenius norm.
Because the objective function O is not convex in both
U and V, it is infeasible to find a global minimum of O [17].
The following iterative update rules provided by Lee et al.
[15] can obtain a local minimum of O:
T
t (XV ) j k
=
u
u tj+1
jk
k
(UVVT ) j k
T
t +1
t (U X)ki
v ki
= v ki
.
(UT UV)ki
By NMF, each data point xi is approximated by a linear
combination of the columns of U, weighted by the elements
of the i -th column of V. Please see [18]–[24] for various NMF
extensions. The NMF has been successfully used in many
multimedia applications [25], [26] and the close technique
probabilistic latent semantic analysis [27] has also been widely
discussed [28].
III. N ONNEGATIVE L OCAL C OORDINATE FACTORIZATION
In this section, we introduce our NLCF algorithm for
obtaining sparse representation.
B. Update Rules
Following some simple algebraic steps, we can rewrite the
objective function as follows:
O = X − UV2 +
N
μ
K
|v ki | uk − xi 2
i=1
k=1
N
1/2 2
(xi 1T − U)i = X − UV2 + μ
i=1
where i = di ag(|vi |) ∈ R K ×K .
Noticing that A2 = Tr(A A T ), we have
O = Tr (X − UV)(X − UV)T
+μ
N
(xi 1T − U)i (xi 1T − U)T
i=1
= Tr XXT + UVVT UT − 2XVT UT
N
T
T
T
T
T
+μ
(xi 1 i 1xi − 2xi 1 i U +Ui U ) . (4)
i=1
A. Objective Function
We first introduce the concept of coordinate coding [12].
Definition: A coordinate coding is a pair (γ , C), where
C ⊂ Rd is a set of anchor points, and γ is a map of
x ∈ Rd to [γv (x)]v∈C ∈ R |C| . It induces
the following physical
approximation of x in Rd : γ (x) = v∈C γv (x)v.
By this definition, the columns of the basis matrix U can be
considered as a set of anchor points, and each data point in the
original space can be approximated by a linear combination of
the anchor points. The columns of V contains the coordinates
of the data points with respect to the anchor points.
In order to obtain sparse codings, each data point should be
represented as a linear combination of only few nearby anchor
points. In other words, each data point should be sufficiently
close to only few anchor points. This can be achieved by
introducing the local coordinate constraint [12]:
Q=
O=
K
k=1
Let ψ j k and φki be the Langrange multiplier for constraints
u j k ≥ 0 and v ki ≥ 0, respectively. We define matrix =
[ψ j k ] and = [φki ], then the Langrange L is
L = Tr XXT + UVVT UT − 2XVT UT
+μ
N
(xi 1T i 1xiT − 2xi 1T i UT + Ui UT )
i=1
+ Tr(UT ) + Tr(VT )
The partial derivatives of L with respect to U and V are:
∂L
= 2UVVT − 2XVT
∂U
N
+ μ
(−2xi 1T i + 2Ui ) + (5)
i=1
|v ki | uk − xi 2 .
(2)
∂L
= 2UT UV − 2UT X + μ(C − 2UT X + D) + .
∂V
(6)
CHEN et al.: NLCF FOR IMAGE REPRESENTATION
971
Define column vector c = di ag(XT X) ∈ R N . Let C =
(c, . . . , c)T be a K × N matrix whose rows are cT . Define
column vector d = di ag(UT U) ∈ R K . Let D = (d, . . . , d) be
a K × N matrix whose columns are d.
Using the KKT conditions ψ j k u j k = 0 and φki v ki = 0, we
get the following equations:
(UVVT ) j k u j k − (XVT ) j k u j k
+ μ(
N
Ui ) j k u j k − μ(
i=1
N
xi 1T i ) j k u j k = 0
i=1
TABLE I
A BBREVIATIONS FOR R EPORTING O PERATION C OUNTS
Abbreviations
Description
fladd
flmlt
fldiv
flam
a floating-point addition
a floating-point multiplication
a floating-point division
an addition and multiplication
TABLE II
C OMPUTATIONAL O PERATION C OUNTS FOR E ACH
M ATRICES ’ M ULTIPLICATION
2(UT UV)ki v ki − 2(UT X)ki v ki
+ μ(C − 2UT X + D)ki v ki = 0
The above equations lead to the following update rules:
N
(XVT + μ i=1
xi 1T i ) j k
u jk ← u jk
N
(UVVT + μ i=1 Ui ) j k
v ki
2(μ + 1)(UT X)ki
← v ki
(2UT UV + μC + μD)ki
(7)
(8)
we will guarantee that the update rules of U and V in Eq. (7)
and Eq. (8) converge and the final solution will be a local
optimum. Please see the Appendix for a detailed proof.
C. Connection With Gradient Method
Here we will reveal the connection between Gradient
Descent method [29] and our multiplicative updating rules
in Eq. (7) and Eq. (8). Let η j k = −u j k /2(UVVT +
N
Ui ) j k , we have
μ i=1
u jk + η jk
= u jk
u jk
∂O
N
2(UVV + μ i=1 Ui ) j k ∂u j k
u jk
−
N
T
2(UVV + μ i=1
Ui ) j k
T
× (2UVVT − 2XVT
+μ
=
N
(−2xi 1T i + 2Ui )) j k
i=1
N
(XVT + μ i=1
xi 1T i ) j k
u jk
N
(UVVT + μ i=1 Ui ) j k
similarly, let δki = −v ki /(2UT UV + μC + μD)ki , we have
∂O
v ki + δki
∂v ki
∂O
v ki
= v ki −
T
(2U UV + μC + μD)ki ∂v ki
v ki
= v ki −
T
(2U UV + μC + μD)ki
× (2UT UV − 2UT X + μ(C − 2UT X + D))ki
= v ki
Dlmlt
MNK
MNK
UVV T
(M + N )K 2
(M + N )K 2
UH
MK2 + NK
MK2
UT X
MNK
MNK
U T UV
(M + N )K 2
(M + N )K 2
C
MN
MN
D
MK
MK
N : the number of sample points
M: the number of features
K : the number of factors
Now it is clear that the multiplicative updating rules in
Eq. (7) and Eq. (8) are special cases of gradient descent
with automatically step parameter selection. The advantage
of multiplicative updating rules is the guarantee of the nonnegativity of U and V.
D. Computational Complexity Analysis
∂O
∂u j k
= u jk −
Fladd
XV T
2(μ + 1)(UT X)ki
.
(2UT UV + μC + μD)ki
In this section, a computational complexity analysis of our
proposed algorithm comparing to NMF is presented.
The common way to express the complexity of one algorithm is using the big O notation [30]. However, it is not an
appropriate way to analyze the complexity of an algorithm
that contains many matrix computations such as NLCF and
NMF. Instead, we count the arithmetic operations for each
algorithm. The four operation abbreviations used in this paper
are summarized in Table I. Please see [31] for more details
about these operation abbreviations.
Based on the updating rules, we count the arithmetic operations of each iteration in NMF and summarize the result in
Table III.
For NLCF, note that
N
xi 1T i = XVT
i=1
N
Ui = UH
i=1
where H is diagonal matrix whose entries are row sums of V.
So we can rewrite Eq. (7) as follows:
u jk ← u jk
((μ + 1)XVT ) j k
(UVVT + μUH) j k
.
972
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 3, MARCH 2013
TABLE III
C OMPUTATIONAL O PERATION C OUNTS FOR E ACH I TERATION IN NMF AND NLCF
Fladd
F-norm Formulation
Flmlt
Fldiv
Overall
NMF
2M N K + 2(M + N )K 2
2M N K + 2(M + N )K 2 + (M + N )K
(M + N )K
O(M N K )
NLCF
2M N K + (3M + 2N )K 2
+ M N + 2M K + 3N K
2M N K + (3M + 2N )K 2
+ M N + 4M K + 4N K
(M + N )K
O(M N K )
N : the number of sample points
M: the number of features
There is no difficulty in counting the computational operation
counts for each matrices multiplication, and it is presented
in Table II. The computational operation counts of C and D
need more explanation. From the definition of C, we need
to compute XT X, which costs N 2 M flam. In reality, there
is no need to do this matrices multiplication,
M 2 we only need
its diagonal entries. Note that c =
j =1 x j i , where i =
1, . . . , N. So it only costs N M flam to compute C. Similarly,
computation of D need M K flam.
We also summarize the computational operation counts for
each iteration of NLCF in Table III. Suppose the multiplicative
updates stop after t iterations, the overall cost for NMF
(F-norm formulation) and NLCF are both O(t M N K ).
E. Incorporate Geometrically Based Regularizer
Another limitation of NMF is that it fails to discover the
intrinsic geometrical and discriminating structure of the data
space, which is essential to the real-world applications [32],
[33]. Cai et al. have proposed a new version of NMF called
Graph regularized Non-negative Matrix Factorization [18],
[20], [23], which adds a geometrically based regularizer to
the original NMF Objective function. In this section, we will
incorporate this regularizer to out NLCF algorithm.
The geometrically based regularizer is the following one:
R = Tr(VLVT ).
(9)
Tr(·) denotes the trace of a matrix. L = E − W is called
graph Laplacian, where W is the weight matrix and the Wi j is
used to measure the closeness of two points xi and x j (please
see [23] for the details of how to build weight matrix). E is
a diagonal matrix whose entries are column (or row, since W
is symmetric) sums of W.
Now we add this regularizer to our NLCF Objective
function,
O = X − UV2 +
N
i=1
μ
K
k=1
(10)
From the updating rules of GNMF [23], we have the
following updating rules:
N
(XVT + μ i=1
xi 1T i ) j k
(11)
u jk ← u jk
N
(UVVT + μ i=1
Ui ) j k
v ki
2((μ + 1)(UT X) + λVW)ki
← v ki
(2UT UV + μC + μD + 2λVE)ki
TABLE IV
S TATISTICS OF T HREE D ATA S ETS
Dataset
Size (N )
Dimensionality (M)
# of Classes (K )
40
ORL
400
1024
MNIST
4000
784
10
Yale
165
1024
15
we will guarantee that the update rules of U and V in Eq. (11)
and Eq. (12) converge and the final solution will be a local
optimum. Please see the Appendix for a detailed proof.
IV. E XPERIMENTAL R ESULTS
In this section, various experiments are performed to
demonstrate the effectiveness of our proposed Non-negative
Local Coordinate Factorization method.
A. Data Corpora
Three image data sets are used in the experiment. The
important statistics of these data sets are summarized below
(see also Table IV):
1) The first one is Cambridge ORL face database1 . There
are ten different images of each of 40 distinct subjects.
All the images were taken against a dark homogeneous
background with the subjects in an upright, frontal
position. We crop the original 112 × 92 images into
64 × 64 gray scale images.
2) The second one is the MNIST database of handwritten
digits2 . We use a test set of 4,000 examples for clustering, which contains 28 × 28 gray scale images of 10
digits.
3) The third one is the Yale Face Database3 containing
32 × 32 gray scale images of 15 individuals. There are
11 images per subject facial expression or configuration.
B. Clustering Evaluation
|v ki | uk − xi 2
+ λTr(VLVT ).
K : the number of factors
(12)
Previous studies show that NMF is very powerful for data
clustering. It is superior to the Latent Semantic Indexing
method (LSI) [34] and several popular spectral clustering
methods [35]. In this experiment, we investigate the effectiveness of our proposed algorithm on image clustering.
We set the parameter K to be the number of clusters and
use the obtained coefficient matrix V to determine the cluster
1 http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.
2 http://yann.lecun.com/exdb/mnist/.
3 http://cvc.yale.edu/projects/yalefaces/yalefaces.html.
CHEN et al.: NLCF FOR IMAGE REPRESENTATION
973
TABLE V
C LUSTERING P ERFORMANCE ON ORL D ATABASE
Cluster
Number
2
4
8
12
16
20
25
30
40
Avg
NMF
98.3 ± 3.2
79.3 ± 15.0
62.3 ± 5.5
56.3 ± 3.4
53.5 ± 3.4
48.1 ± 4.5
46.0 ± 3.2
41.9 ± 3.1
39.5
58.4
Accuracy (Mean ± Std%)
NLCF
NLCF-G
Kmeans
98.5 ± 3.2
98.5 ± 4.5
92.5 ± 10.6
82.0 ± 12.7
83.5 ± 12.7
72.0 ± 13.7
76.6 ± 13.2
69.3 ± 9.3
62.1 ± 7.5
70.8 ± 7.3
65.0 ± 4.2
55.3 ± 9.4
67.4 ± 4.8
63.5 ± 7.5
57.5 ± 6.8
64.7 ± 7.2
60.4 ± 5.1
55.4 ± 4.0
63.5 ± 2.1
57.4 ± 2.8
56.7 ± 1.5
60.4 ± 3.1
54.6 ± 3.1
52.3 ± 2.6
61.8
53.8
47.5
71.7
67.3
61.3
NMF-SC
98.0 ± 2.0
81.0 ± 13.4
72.5 ± 8.7
63.0 ± 6.6
58.3 ± 3.7
50.8 ± 3.2
50.2 ± 3.8
45.7 ± 2.1
41.0
62.3
NMF
92.9 ± 15.1
70.6 ± 17.9
64.4 ± 5.2
64.1 ± 2.9
65.0 ± 3.0
62.4 ± 3.0
62.9 ± 2.1
61.2 ± 2.1
61.6
67.2
Normalized Mutual Information (Mean ± Std%)
NLCF
NLCF-G
Kmeans
93.7 ± 13.1
94.9 ± 15.2
76.6 ± 31.7
75.5 ± 15.9
78.3 ± 15.4
64.7 ± 15.1
78.8 ± 12.0
73.5 ± 10.0
67.6 ± 9.4
77.3 ± 5.0
73.7 ± 3.4
66.4 ± 9.4
77.1 ± 2.9
74.2 ± 6.5
70.9 ± 5.5
75.4 ± 3.7
73.4 ± 3.4
70.0 ± 3.4
76.5 ± 1.7
72.4 ± 2.4
72.5 ± 1.9
75.4 ± 2.4
72.4 ± 2.6
70.6 ± 2.1
76.5
73.4
68.7
78.5
76.3
69.8
NMF-SC
95.2 ± 9.7
75.2 ± 17.2
73.1 ± 6.8
70.0 ± 4.8
68.1 ± 2.5
64.3 ± 1.7
65.3 ± 2.6
63.4 ± 1.6
61.4
70.7
TABLE VI
C LUSTERING P ERFORMANCE ON MNIST D ATABASE
Accuracy (Mean ± Std%)
NLCF-G
Kmeans
Cluster
Number
NMF
NLCF
2
3
4
5
6
7
8
9
10
88.0 ± 13.5
78.4 ± 11.5
75.6 ± 10.5
63.5 ± 10.0
52.9 ± 4.3
52.8 ± 5.9
51.4 ± 5.5
45.3 ± 2.7
43.8
89.0 ± 14.5
82.9 ± 9.2
82.0 ± 8.6
71.5 ± 10.3
63.9 ± 6.0
63.1 ± 8.1
62.4 ± 5.8
53.9 ± 4.2
57.0
89.1 ± 14.8
84.2 ± 9.5
79.9 ± 10.6
70.8 ± 11.6
62.8 ± 4.3
61.5 ± 8.8
59.8 ± 6.8
55.9 ± 4.7
55.0
Avg
61.3
69.5
68.8
NMF
88.8 ± 13.7
77.1 ± 13.5
73.2 ± 10.1
68.0 ± 7.8
58.7 ± 6.4
60.9 ± 10.6
58.1 ± 5.5
52.0 ± 4.0
50.4
85.4 ± 14.2
82.1 ± 8.8
74.4 ± 11.9
62.4 ± 7.2
57.4 ± 3.7
56.1 ± 5.2
56.6 ± 5.2
46.2 ± 4.0
48.0
58.8 ± 29.0
52.0 ± 12.0
51.7 ± 12.6
45.7 ± 10.5
40.0 ± 4.4
40.6 ± 4.4
40.8 ± 4.6
36.8 ± 1.7
35.6
63.2 ± 30.0
59.0 ± 11.2
60.5 ± 10.8
55.6 ± 9.5
50.3 ± 5.7
49.2 ± 5.3
50.4 ± 3.4
46.6 ± 2.4
47.0
63.8 ± 30.6
60.9 ± 11.9
59.6 ± 10.8
56.4 ± 10.8
49.6 ± 5.7
51.2 ± 5.0
50.3 ± 3.8
47.7 ± 2.5
48.1
61.8 ± 29.2
54.6 ± 13.1
55.4 ± 8.9
52.9 ± 7.0
48.2 ± 7.5
49.2 ± 7.1
48.3 ± 3.3
46.3 ± 1.9
45.4
51.0 ± 28.7
57.1 ± 10.7
51.7 ± 12.9
45.2 ± 7.0
43.0 ± 3.0
42.9 ± 4.4
44.6 ± 4.0
37.0 ± 3.8
41.5
65.2
63.2
44.7
53.5
54.2
51.3
46.0
label of each data point. More precisely, we examine each
column of the matrix V, and assign the sample xi to cluster c
if c = arg max v ki .
k
1) Evaluation Metric: The clustering result is evaluated
by comparing the obtained label of each sample with that
provided by the data set. Three metrics have been used in
our experiments. The accuracy (AC) [36] and the normalized
mutual information metric (M I ) [36] are used to measure the
clustering performance, while sparseness (S P) [11] measures
the sparseness of coefficients matrix.
Given a data point xi , let ri and si be the cluster label
and the label provided by the corpus, respectively. The AC is
defined as follows:
N
δ(si , map(ri ))
AC = i=1
N
where N is the total number of samples and δ(x, y) is the delta
function that equals one if x = y and equals zero otherwise,
and map(ri ) is the permutation mapping function that maps
each cluster label ri to the equivalent label from the data
corpus. The best mapping can be found by using the KuhnMunkres algorithm [37].
On the other hand, let C denote the set of clusters obtained
from the ground truth and C obtained from our algorithm.
Their mutual information metric M I (C, C ) is defined as
follows:
p(ci , cj )
M I (C, C ) =
p(ci , c j ) · log2
p(ci ) · p(cj )
ci ∈C,c j ∈C
Normalized Mutual Information (Mean ± Std%)
NLCF
NLCF-G
Kmeans
NMF-SC
NMF-SC
where p(ci ) and p(cj ) are the probabilities that a sample arbitrarily selected from the data set belongs to the clusters ci and
cj , respectively, and p(ci , cj ) is the joint probability that the
arbitrarily selected sample belongs to the clusters ci as well as
cj at the same time. In our experiments, we use the normalized
mutual information M I as follows:
M I (C, C )
M I (C, C ) =
max(H (C), H (C ))
where H (C) and H (C ) are the entropies of C and C ,
respectively. It is easy to check that M I (C, C ) ranges from
0 to 1. M I = 1 if the two sets of clusters are identical, and
M I = 0 if the two sets are independent.
The sparseness measure [11], based on the relationship
between the L 1 norm and the L 2 norm, is as follows:
√
N − ( |v i |/
v i2 )
S P(v) =
√
N −1
where v is a column vector of V. If all components of v
are equal, S P(v) evaluates to unity, and if v only contains a
single non-zero component, S P(v) evaluates to zero. In our
experiments, we take the average sparseness over all the new
representations (column vectors of V).
2) Clustering Results: To show the improvement of the
clustering performance by our method, we compared NLCF
and NLCF-G(NLCF with Graph regularizer) with the following three popular algorithms:
1) Non-negative Matrix Factorization based clustering
(NMF in short).
974
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 3, MARCH 2013
TABLE VII
C LUSTERING P ERFORMANCE ON YALE D ATABASE
NMF
72.3 ± 11.2
60.6 ± 8.1
61.4 ± 9.7
52.0 ± 5.0
51.4 ± 10.1
50.8 ± 6.9
43.0 ± 4.9
43.8 ± 4.0
40.2 ± 3.6
39.7 ± 3.4
36.7 ± 2.7
37.0 ± 3.6
35.0 ± 2.0
36.4
47.2
Accuracy (Mean ± Std%)
NLCF
NLCF-G
Kmeans
85.0 ± 11.0
84.1 ± 10.6
71.4 ± 11.3
68.5 ± 15.3
69.7 ± 17.9
60.9 ± 12.1
63.9 ± 8.7
63.4 ± 8.5
51.1 ± 7.6
55.6 ± 9.1
54.7 ± 7.9
46.6 ± 11.4
52.6 ± 8.1
52.3 ± 10.2
47.1 ± 6.8
50.8 ± 5.9
51.6 ± 8.9
48.2 ± 7.9
49.6 ± 5.6
46.5 ± 5.9
46.0 ± 7.1
47.5 ± 4.8
47.2 ± 3.8
41.5 ± 5.2
48.3 ± 4.9
47.7 ± 5.7
40.7 ± 7.4
46.0 ± 3.4
45.5 ± 4.9
42.3 ± 4.2
43.4 ± 4.0
42.1 ± 3.1
38.9 ± 5.2
45.0 ± 2.4
42.2 ± 3.5
42.2 ± 3.7
42.3 ± 3.0
41.6 ± 2.7
37.7 ± 4.5
49.1
42.3
37.0
53.4
52.2
46.5
1
NMF
20.6 ± 14.9
31.1 ± 12.3
43.8 ± 12.7
36.3 ± 8.6
40.6 ± 11.5
41.9 ± 6.6
38.0 ± 5.2
40.4 ± 4.2
39.5 ± 2.6
40.2 ± 2.2
38.7 ± 2.2
41.0 ± 3.2
40.3 ± 1.7
41.0
38.1
Normalized Mutual Information (Mean ± Std%)
NLCF
NLCF-G
Kmeans
48.1 ± 23.7
44.1 ± 22.4
21.1 ± 18.6
43.8 ± 22.5
44.7 ± 24.9
30.2 ± 16.6
46.3 ± 12.3
46.3 ± 12.3
33.0 ± 9.5
40.0 ± 11.3
39.0 ± 10.1
28.3 ± 11.9
42.3 ± 10.6
43.9 ± 11.6
37.5 ± 5.7
43.5 ± 7.1
44.2 ± 9.0
40.9 ± 9.8
43.2 ± 7.0
41.4 ± 6.7
40.7 ± 8.3
43.0 ± 4.4
45.6 ± 4.2
40.1 ± 6.3
47.0 ± 4.6
47.5 ± 5.0
41.5 ± 6.3
46.9 ± 2.6
47.0 ± 4.1
41.9 ± 5.3
46.1 ± 3.2
45.8 ± 2.0
40.5 ± 4.3
49.0 ± 2.5
47.9 ± 2.5
45.6 ± 4.0
47.4 ± 2.7
47.8 ± 2.1
42.8 ± 4.0
53.7
48.5
40.2
45.7
45.3
37.5
NLCF
NLCF−G
NMF
Kmeans
NMF−SC
0.8
Accuracy
0.8
0.7
0.6
NMF-SC
15.1 ± 12.6
34.5 ± 20.6
40.3 ± 11.0
39.7 ± 6.7
43.6 ± 6.9
42.1 ± 6.1
42.6 ± 6.1
41.8 ± 7.2
45.2 ± 4.2
45.0 ± 4.6
46.8 ± 4.0
48.3 ± 2.7
46.5 ± 1.6
48.8
41.5
0.9
0.9
NLCF
NLCF−G
NMF
Kmeans
NMF−SC
0.9
Accuracy
NMF-SC
65.5 ± 8.9
66.7 ± 15.4
59.8 ± 8.9
54.6 ± 4.5
54.9 ± 6.7
50.0 ± 4.5
47.3 ± 4.9
45.6 ± 7.0
45.8 ± 4.9
43.6 ± 5.2
44.6 ± 4.3
44.8 ± 2.0
42.1 ± 2.8
40.0
50.4
0.7
NLCF
NLCF−G
NMF
Kmeans
NMF−SC
0.8
Accuracy
Cluster
Number
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Avg
0.6
0.7
0.6
0.5
0.5
0.5
0.4
0.4
10
20
30
Number of classes
0.4
40
2
4
6
8
Number of classes
(a)
0.8
0.75
0.7
0.65
10
20
30
Number of classes
40
15
(c)
NLCF
NLCF−G
NMF
Kmeans
NMF−SC
0.65
0.6
0.55
0.5
0.45
0.4
0.35
10
Number of classes
0.55
Normalized mutual information
0.9
0.85
Normalized mutual information
Normalized mutual information
NLCF
NLCF−G
NMF
Kmeans
NMF−SC
0.95
5
(b)
0.7
1
0.6
0.3
10
2
4
6
8
Number of classes
(d)
(e)
10
0.5
0.45
0.4
0.35
0.3
NLCF
NLCF−G
NMF
Kmeans
NMF−SC
0.25
0.2
0.15
0.1
5
10
Number of classes
15
(f)
Fig. 1. Accuracy versus the number of clusters on (a) ORL, (b) MNIST, and (c) Yale databases. The normalized mutual information versus the number of
clusters on (d) ORL, (e) MNIST, and (f) Yale databases.
2) Canonical K-means clustering method (Kmeans in
short).
3) Non-negative Matrix Factorization with Sparseness
Constraints (NMF-SC in short, [11]).
The evaluations were conducted with different numbers of
clusters. On ORL data set, the cluster number ranges from 2 to
40. On MNIST data set, the cluster number ranges from 2 to
10. On Yale data set, the cluster number ranges from 2 to 15.
For each given cluster number, 10 test runs were conducted
on different randomly chosen clusters. The final performance
is recorded by averaging the performance of the 10 tests.
Table V, VI, VII and Fig. 1 show the clustering results on
the data sets ORL, MNIST and Yale. The average sparseness
of the coefficients matrix is reported in Table VIII.
On ORL data set, the average clustering accuracies obtained
by NLCF, NLCF-G, NMF, NMF-SC, and Kmeans are 71.7%,
67.3%, 58.4%, 62.3%, and 61.3%, respectively. Comparing
to the third best approach, that is, NMF-SC, NLCF achieves
9.4% accuracy improvement and NLCF-G achieves 5.0%.
For mutual information, it can be seen that NLCF achieves
7.8%, NLCF-G achieves 5.6% improvement over NMF-SC.
On MNIST data set, the average clustering accuracies obtained
by NLCF, NLCF-G, NMF, NMF-SC, and Kmeans are 69.5%,
68.8%, 61.3%, 63.2%, and 65.2%, respectively. Again, NLCF
and NLCF-G significantly outperform other three algorithms
in terms of both accuracy and mutual information. On
Yale data set, the average clustering accuracies obtained by
NLCF, NLCF-G, NMF, NMF-SC and Kmeans are 53.4%,
CHEN et al.: NLCF FOR IMAGE REPRESENTATION
975
55
50
45
40
35
−2
10
75
70
10
µ
(b)
NLCF
NMF
kmeans
NMF−SC
90
30
NLCF
NMF
kmeans
NMF−SC
10
−2
10
0
10
(e)
35
µ
10
10
(f)
90
50
45
NLCF
NMF
kmeans
NMF−SC
40
80
NLCF
NMF
70
60
50
40
35
−2
10
0
0
µ
100
Sparseness
NLCF
NMF
kmeans
NMF−SC
40
30
−2
10
NLCF
NMF
70
50
−2
10
0
10
55
Normalized mutual information
Accuracy
45
80
60
µ
(d)
50
10
(c)
40
20
0
µ
100
Sparseness
Normalized mutual information
Accuracy
40
µ
60
20
−2
10
0
10
50
50
20
−2
10
NLCF
NMF
40
(a)
60
30
80
65
60
−2
10
0
µ
NLCF
NMF
kmeans
NMF−SC
Sparseness
60
Accuracy
100
80
NLCF
NMF
kmeans
NMF−SC
Normalized mutual information
65
(g)
0
µ
10
−2
10
(h)
0
µ
10
(i)
Fig. 2. Accuracy of NLCF versus the parameter μ on (a) ORL, (d) MNIST, and (g) Yale databases. The normalized mutual information of NLCF versus
the parameter μ on (b) ORL, (e) MNIST, and (h) Yale databases. The sparseness of NLCF versus the parameter μ on (c) ORL, (f) MNIST, and (i) Yale
databases.
TABLE VIII
AVERAGE S PARSENESS OF C OEFFICIENTS M ATRIX
Database
ORL
MNIST
Yale
Sparseness (%)
NMF
NLCF
34.4
84.3
59.3
96.5
40.5
93.3
52.2%, 47.2%, 50.4% and 46.5%, respectively. In this data
set, NLCF, NLCF-G and NMF-SC get similar performance
and are superior to NMF and Kmeans. But NLCF and
NLCF-G still narrowly beat NMF-SC both in accuracy and
mutual information.
The sparseness of the encodings obtained by NLCF is
greater than 80 on both data sets. This indicates that our
proposed approach can indeed obtain highly sparse representations, which in turn, improves the clustering performance.
3) Parameter Selection: Our NLCF model has only one
essential parameter: the regularization parameter μ. NLCF
boils down to the original NMF when the regularization
parameter μ = 0. As μ increases, we expect the learned
encodings become more sparse.
Fig. 2 shows how the average clustering performance and
the sparseness of learned encodings vary with the parameters μ, respectively. As we can see, NLCF achieves good
performance with the μ varying from 0.1 to 1, and the
sparseness of the encodings increases as μ increases.
C. Basis Vectors and Image Encodings
In this test, we randomly select 25 subjects from the ORL
database and for each subject we randomly select 5 face
images. Fig. 3 shows the sample images from the ORL
database, and the basis vectors and image encodings obtained
by NMF and NLCF.
Comparing the basis images obtained by NLCF with the
original face images, we find that the basis images look like
the original face images very much. This shows that our local
976
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 3, MARCH 2013
120
120
100
100
80
80
60
60
40
40
20
20
0
0
(a)
20 40 60 80 100 120 140
0
0
20 40 60 80 100 120 140
(a)
(b)
(c)
10
x 10
8
NLCF
8
6
4
2
0
0
100
200
300
400
500
Objective function value
(b)
Objective function value
Fig. 4. Toy example of learning overcomplete basis. The black circles denote
the learned basis vectors. Each color represents a cluster obtained by (a) NMF
or (b) NLCF.
12
x 10
7
NMF
10
8
6
4
2
0
100
Iteration#
coordinate constraint can indeed generate basis images (i.e.
the anchor points) which are sufficiently close to the original
images.
Comparing with the image encodings obtained by NMF, the
image encodings obtained by NLCF are much more sparse.
For NLCF, more than half of the image encodings only have
one nonzero element. And the nonzero element is exactly the
coordinate coefficient with respect to the basis image which
is closest to this face image.
D. Learning Overcomplete Basis
Usually, the parameter K (the dimension of the new representations) are set to be less than the dimension of the
original data space. However, in some cases, it is desirable
to learn overcomplete basis [38]–[41], where K is set to be
larger than the original dimension. This problem has received
considerable attention since the work of Olshausen and Field
[39], who suggest that this is the strategy used by the visual
cortex for representing images. The implication is that a
sparse, overcomplete representation is special suitable for
visual tasks such as object detection and recognition that occur
in higher regions of the cortex [40]. We give a simple synthetic
example to show how our proposed algorithm performs for
learning overcomplete basis.
NLCF
2
1.8
1.6
1.4
0
100
200
300
400
500
Objective function value
Fig. 3. 5 × 5 montages. (a) Original images. (b) and (d) Basis vectors
learned by NMF and NLCF. (c) and (e) Image encodings (i.e., the obtained
new representations) of the two methods. Positive values are illustrated with
white pixels. The encodings learned by NLCF are much more sparse than
those learned by NMF.
2.2
7
x 10
NMF
5
4
3
0
100
6
4
2
200
300
300
400
500
(d)
8
100
200
Iteration#
NLCF
0
0
500
6
(c)
8
400
9
Iteration#
x 10
300
(b)
10
400
500
Objective function value
(e)
Objective function value
(d)
Objective function value
(a)
x 10
200
Iteration#
2
x 10
8
NMF
1.5
1
0.5
0
100
200
300
Iteration#
Iteration#
(e)
(f)
400
500
Fig. 5. Convergence curves of NLCF on (a) ORL, (c) MNIST, and (e) Yale
databases. The convergence curves of NMF on (b) ORL, (d) MNIST, and
(f) Yale databases.
We randomly generate 180 points from mixture of four
Gaussians in a 2-dimensional space. NMF and NLCF are
performed to cluster these data points into four clusters, as
shown in Fig. 4. As we can see, NLCF performs much better
than NMF. Note that the dimension of the input data is 2,
but we use 4 basis vectors. The four basis vector obtained by
NLCF exactly reside at the centers of the four clusters, one for
each. However, the basis vector obtained by NMF are far away
from the data points. The reason is that the four basis vectors
(in fact, two are sufficient) span the two-dimensional space.
Thus, there will be infinitely many solutions for NMF leading
to zero reconstruction error. For our algorithm, it introduces
a local coordinate constraint which require the basis vector to
be sufficiently close to the data points.
CHEN et al.: NLCF FOR IMAGE REPRESENTATION
E. Convergence Study
We have proved that the updating rules for minimizing
the objective function of NLCF are convergent. Here we
investigate how fast the algorithm can converge and compare
with NMF.
Fig. 5 shows the convergence curves of NLCF on the three
data sets. For each figure, the y-axis is the value of objective
function and the x-axis is the iteration number. As can be seen,
NLCF converges within 20 iterations on the ORL database,
within 100 iterations on the MNIST database and within 50
iterations on the Yale database. NLCF converges faster than
NMF on both ORL and Yale databases but slower on the
MNIST database.
977
The auxiliary function is a useful concept because of the
following lemma:
Lemma 1: If G is an auxiliary function of F, then F is
nonincreasing under the update
u (t +1) = arg min G(u, u (t ))
Proof:
F(u (t +1)) ≤ G(u (t +1), u (t )) ≤ G(u (t ), u (t ) ) = F(u (t ))
Now we will show that the update step for U in Eq. (11) is
exactly the update in Eq. (13) with a proper auxiliary function.
We rewrite the objective function O in Eq. (10) as follows
O=
V. C ONCLUSION
We have presented a novel method for matrix factorization,
called Non-negative Local Coordinate Factorization (NLCF).
NLCF aims to ensure sparseness of the new representations
by adding a local coordinate constraint. The learned basis
vector are close to the cluster centers. Thus, each data point
can be represented by linear combination of only few basis
vectors, yielding sparse representation. This property also
makes the algorithm particularly suitable for data clustering,
as demonstrated in our experiments. We have also shown that
NLCF is more effective than NMF for learning overcomplete
basis.
One question remains to be investigated in our future work:
There is another objective function of NMF, the “divergence”
one. How to incorporate the local coordinate constraint into
the divergence objective function is a remaining problem.
(13)
u
M K
N
(x j i −
u j k v ki )2
j =1 i=1
+μ
k=1
N K
i=1
|v ki |
M
(u j k − x j i )2
j =1
k=1
Considering any element u ab in U , we use Fab to denote the
part of O which is only relevant to u ab . From Eq. (4), it is
easy to check that
∂O Fab
=
∂U ab
N
= (2UVVT − 2XVT )ab + μ
(−2xi 1T i + 2Ui )ab
i=1
Fab
= 2(VVT )bb + 2μ
N
(i )bb
i=1
A PPENDIX
P ROOF OF C ONVERGENCE
In this section, we show that the iteration steps in Eq. (7),
Eq. (8) and the iteration steps in Eq. (11), Eq. (12) are
convergent. As we know that Eq. (10) is a general version
of Eq. (3), if we set the parameter λ to zero, Eq. (3) boils
down to Eq. (10). so we just give the proof of the general
version.
We have the following theorem:
Theorem 1: The objective function O in Eq. (10) is nonincreasing under the update rules in Eq. (11) and Eq. (12). The
objective function is invariant under these updates if and only
if U and V are at a stationary point.
To prove Theorem 1, we need to show that the objective
function Eq. (10) is bounded from below and nonincreasing
under the update steps in Eq. (11) and Eq. (12). Since the
objective function O is greater than zero, we only need to
verify that O is nonincreasing under the update steps in
Eq. (11) and Eq. (12).
Our proof will make use of an auxiliary function similar
to that used in the Expectation-Maximization algorithm [42]:
definition: G(u, u ) is an auxiliary function for F(u) if the
conditions
G(u, u ) ≥ F(u), G(u, u) = F(u)
are satisfied.
Since our update is essentially element-wise, it is sufficient to
show that each Fab is nonincreasing under the update step of
Eq. (11).
Lemma 2: The function
(t )
(t )
(t )
(t )
(u ab )(u − u ab )
G(u, u ab ) = Fab (u ab ) + Fab
N
T
(UVV )ab +μ i=1
(Ui )ab
(t )
+
(u −u ab )2 (14)
(t )
u ab
is an auxiliary function for Fab . The part of O which is only
relevant to u ab .
Proof: Since G(u, u) = Fab (u) is obvious, we only need
(t )
to show that G(u, u ab ) ≥ Fab (u). To do this, we compare the
Taylor series expansion of Fab (u)
)
(t )
(t )
Fab (u) = Fab (u (t
ab ) + Fab (u ab )(u − u ab )
N
) 2
+ (VVT )bb + μ
(i )bb (u − u (t
ab )
i=1
(t )
with Eq. (14) to find that G(u, u ab ) ≥ Fab (u) is equivalent to
(UVVT )ab + μ
N
(Ui )ab
i=1
(t )
(t )
≥ (VVT )bb u ab + μu ab
N
(i )bb
i=1
(15)
978
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 3, MARCH 2013
(t )
Thus, Eq. (17) holds and G(v, v ab
) ≥ Hab (v).
We can now demonstrate the convergence of Theorem 1:
(t )
Proof of Theorem 1: Replacing G(u, u ab ) in Eq. (13)
(t )
by Eq. (14) and replacing G(v, v ab ) in Eq. (13) by Eq. (16)
results in the update rule:
We have
(UVVT )ab =
K
)
(t )
T
T
u (t
al (VV )lb ≥ (VV )bb u ab
l=1
and
μ
N
N K
(t )
(Ui )ab = μ
u al (i )lb
i=1
i=1
(t +1)
u ab
l=1
N (t )
u ab (i )bb
=μ
i=1
=
(t )
μu ab
N
i=1
(t )
Thus, Eq. (15) holds and G(u, u ab ) ≥ Fab (u).
Then we are going to show that the update step for V
in Eq. (12) is exactly the update in Eq. (13) with a proper
auxiliary function just like U.
Considering any element v ab in V , we still use Hab to
denote the part of O which is only relevant to v ab . It is easy
to check that
∂O =
Hab
∂V ab
= (2UT UV − 2UT X + μ(C − 2UT X + D) + 2λVL)ab
Hab
= 2(UT U)aa + 2λLbb .
Now we have the following lemma.
Lemma 3: Function
(t )
(t )
(t )
(t )
(v ab )(v − v ab )
G(v, v ab ) = Hab (v ab ) + Hab
+
(UT UV + 21 μC + 12 μD + λVE)ab
(t )
v ab
(t ) 2
(v − v ab
)
(16)
is an auxiliary function for Hab , the part of O which is only
relevant to v ab .
Proof: Since G(v, v) = Hab (v), we only need to show
(t )
that G(v, v ab ) ≥ Hab (v). We compare the Taylor series
expansion of Hab (v)
(t )
(t )
(t )
(v ab )(v − v ab )
Hab (v) = Hab (v ab ) + Hab
(t ) 2
)
+((UT U)aa + λLbb )(v − v ab
(t )
with Eq. (16) to find that G(v, v ab ) ≥ Hab (v) is equivalent to
1
1
(t )
(UT UV + μC + μD + λVE)ab ≥ v ab
((UT U)aa + λLbb )
2
2
(17)
We have
T
(U UV)ab
K
(t )
(t )
=
(UT U)al vlb ≥ v ab (UT U)aa
l=1
and
λ(VE)ab =
N
(t )
(t +1)
(t )
(t )
v ab
= v ab
− v ab
(i )bb
(t )
(t )
v ai
Eib ≥ λv ab
Ebb
i=1
(t )
(t )
≥ λv ab
(E − W)bb = λv ab
Lbb
(t )
(u )
Fab
ab
N
2(UVVT )ab + 2μ i=1
(Ui )ab
N
T
T
) (XV + μ
i=1 xi 1 i )ab
= u (t
ab
N
(UVVT + μ i=1 Ui )ab
(t )
= u ab − u ab
(t )
= v ab
(v (t ) )
Hab
ab
(2UT UV + μC + μD + 2λVE)ab
2(μ + 1)(UT X)ab
+ μC + μD + 2λVE)ab
(2UT UV
Since Eq. (14) and Eq. (16) are auxiliary functions, Fab and
Hab are nonincreasing under the update rule.
Theorem 1 guarantees that the update rules of U and V in
Eq. (11), Eq. (12) converge and the final solution will be a
local optimum.
R EFERENCES
[1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
New York: Wiley, 2000.
[2] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999.
[3] X. Li, Y. Pang, and Y. Yuan, “L1-norm-based 2DPCA,” IEEE Trans.
Syst., Man, Cybern. B, Cybern., vol. 40, no. 4, pp. 1170–1175, Aug.
2010.
[4] Y. Pang, X. Li, and Y. Yuan, “Robust tensor analysis with L1-norm,”
IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 2, pp. 172–178,
Feb. 2010.
[5] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semi-supervised nonlinear
hashing using bootstrap sequential projection learning,” IEEE Trans.
Knowl. Data Eng., Mar. 2012.
[6] N. K. Logothetis and D. L. Sheinberg, “Visual object recognition,” Annu.
Rev. Neurosci., vol. 19, pp. 577–621, Mar. 1996.
[7] S. E. Palmer, “Hierarchical structure in perceptual representation,”
Cognit. Psychol., vol. 9, no. 4, pp. 441–474, 1977.
[8] M. W. O. E. Wachsmuth and D. I. Perrett, “Recognition of objects and
their component parts: Responses of single units in the temporal cortex
of the macaque,” Cerebral Cortex, vol. 4, no. 5, pp. 509–522, 1994.
[9] M. Heiler and C. Schnörr, “Larning sparse representations by nonnegative matrix factorization and sequential cone programming,” J.
Mach. Learn. Res., vol. 7, pp. 1385–1407, Jul. 2006.
[10] P. O. Hoyer, “Non-negative sparse coding,” in Proc. 12th IEEE Workshop
Neural Netw. Signal Process., Mar. 2002, pp. 557–565.
[11] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Dec. 2004.
[12] K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coordinate coding,” in Proc. 23rd Annu. Conf. Neural Inf. Process. Syst., 2009,
pp. 1–9.
[13] J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. Conf.
Comput. Vis. Pattern Recognit., 2010, pp. 3360–3367.
[14] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face
recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.
[15] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix
factorization,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 556–
562.
[16] P. Pentti and T. Unto, “Positive matrix factorization: A non-negative
factor model with optimal utilization of error estimates of data values,”
Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
[17] C.-J. Lin, “On the convergence of multiplicative update algorithms for
nonnegative matrix factorization,” IEEE Trans. Neural Netw., vol. 18,
no. 6, pp. 1589–1596, Nov. 2007.
CHEN et al.: NLCF FOR IMAGE REPRESENTATION
[18] D. Cai, X. He, X. Wu, and J. Han, “Non-negative matrix factorization
on manifold,” in Proc. Int. Conf. Data Mining, 2008, pp. 1–10.
[19] H. Kim and H. Park, “Sparse non-negative matrix factorizations via
alternating non-negativity-constrained least squares for microarray data
analysis,” Bioinformatics, vol. 23, no. 12, pp. 1495–1502, 2007.
[20] D. Cai, X. He, X. Wang, H. Bao, and J. Han, “Locality preserving
nonnegative matrix factorization,” in Proc. 21st Int. Joint Conf. Artif.
Intell., 2009, pp. 1010–1015.
[21] X. Li and Y. Pang, “Deterministic column-based matrix decomposition,”
IEEE Trans. Knowl. Data Eng., vol. 22, no. 1, pp. 145–149, Jan. 2010.
[22] Y. Yuan, X. Li, Y. Pang, X. Lu, and D. Tao, “Binary sparse nonnegative
matrix factorization,” IEEE Trans. Circuits Syst. for Video Technol.,
vol. 19, no. 5, pp. 772–777, May 2009.
[23] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
matrix factorization for data representation,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.
[24] D. Cai, X. He, and J. Han, “Locally consistent concept factorization for
document clustering,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6,
pp. 902–913, Jun. 2011.
[25] V. Monga and M. K. Mihçak, “Robust and secure image hashing via
non-negative matrix factorizations,” IEEE Trans. Inf. Forensics Security,
vol. 2, no. 3, pp. 376–390, Sep. 2007.
[26] Y. Xu, Z. Zhang, P. Yu, and B. Long, “Pattern change discovery between
high dimensional data sets,” in Proc. 20th ACM Int. Conf. Inf. Knowl.
Manage., 2011, pp. 1097–1106.
[27] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. 22nd
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., 1999, pp. 50–
57.
[28] Y. Pang, Q. Hao, Y. Yuan, T. Hu, R. Cai, and L. Zhang, “Summarizing
tourist destinations by mining user-generated travelogues and photos,”
Comput. Vis. Image Understand., vol. 115, no. 3, pp. 352–363, 2011.
[29] J. Kivinen and M. K. Warmuth, “Additive versus exponentiated gradient
updates for linear prediction,” in Proc. 27th Annu. ACM Symp. Theory
Comput., 1995, pp. 209–218.
[30] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
to Algorithms, 2nd ed. Cambridge, MA: MIT Press, 2001.
[31] G. W. Stewart, Matrix Algorithms, vol. 1. Philadelphia, PA: SIAM, 1998.
[32] D. Cai, X. He, and J. Han, “Using graph model for face analysis,” Dept.
Comput. Sci., UIUC, Urbana, Tech. Rep. UIUCDCS-R-2005-2636, Sep.
2005.
[33] D. Cai, X. Wang, and X. He, “Probabilistic dyadic data analysis with
local and global consistency,” in Proc. 26th Annu. Int. Conf. Mach.
Learn., 2009, pp. 105–112.
[34] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and
R. Harshman, “Indexing by latent semantic analysis,” J. Amer. Soc. Inf.
Sci., vol. 41, no. 6, pp. 391–407, 1990.
[35] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis
and an algorithm,” in Advances in Neural Information Processing
Systems. Cambridge, MA: MIT Press, 2001, pp. 849–856.
[36] D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 12, pp.
1624–1637, Dec. 2005.
[37] L. Lovasz and M. Plummer, Matching Theory. New York: Akadémiai
Kiadó, 1986.
[38] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive
field properties by learning a sparse code for natural images,” Nature,
vol. 381, no. 6583, pp. 607–609, 1996.
[39] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete
basis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23, pp.
3311–3325, 1997.
[40] J. F. Murray and K. Kreutz-Delgado, “Learning sparse overcomplete
codes for images,” J. VLSI Signal Process. Syst., vol. 46, no. 1, pp.
97–110, 2006.
[41] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix
factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–
60, Dec. 2010.
[42] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the em algorithm,” J. Royal Stat. Soc., Ser. B,
vol. 39, no. 1, pp. 1–38, 1977.
979
Yan Chen received the Bachelor’s degree in mathematics from Jiangnan University, Wuxi, China,
and the Master’s degree in computer science from
Zhejiang University, Hangzhou, China, in 2009 and
2012, respectively.
He is currently with the State Key Laboratory of
CAD&CG, College of Computer Science, Zhejiang
University. His current research interests include
machine learning and computer vision.
Jiemi Zhang received the B.S. degree in mathematics from Southeast University, Nanjing, China.
She is currently pursuing the Master’s degree with
the State Key Laboratory of CAD&CG, College of
Computer Science, Zhejiang University, Hangzhou,
China.
Her current research interests include machine
learning, computer vision, and multimedia information retrieval.
Deng Cai (M’09) received the Ph.D. degree in
computer science from the University of Illinois at
Urbana-Champaign, Urbana, in 2009.
He is currently an Associate Professor with
the State Key Laboratory of CAD&CG, College of Computer Science, Zhejiang University,
Hangzhou, China. His current research interests
include machine learning, data mining, computer
vision, and information retrieval.
Wei Liu received the M.Phil. and Ph.D. degrees
in electrical engineering from Columbia University,
New York, NY, in 2012.
He was an Intern with the Kodak Research Laboratories and IBM Thomas J. Watson Research Center,
in 2010 and 2011, respectively. He is currently the
Josef Raviv Memorial Post-Doctoral Fellow with the
IBM Thomas J. Watson Research Center, Yorktown
Heights, NY. His current research interests include
machine learning, computer vision, pattern recognition, and information retrieval.
Dr. Liu was a recipient of the 2011-2012 Facebook Fellowship.
Xiaofei He (SM’10) received the B.S. degree
from Zhejiang University, Hangzhou, China, and
the Ph.D. degree from the University of Chicago,
Chicago, IL, in 2000 and 2005, respectively, both in
computer science.
He is currently a Professor with the State Key
Laboratory of CAD&CG, Zhejiang University. He
was a Research Scientist with Yahoo! Research
Labs, Burbank, CA. His current research interests
include machine learning, information retrieval, and
computer vision.