Low-Rank Similarity Metric Learning in High Dimensions.

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
Low-Rank Similarity Metric Learning in High Dimensions
Wei Liu†
†
Cun Mu‡
Rongrong Ji\
Shiqian Ma§
‡
\
John R. Smith†
Shih-Fu Chang‡
§
IBM T. J. Watson Research Center Columbia University Xiamen University The Chinese University of Hong Kong
{weiliu,jsmith}@us.ibm.com [email protected]
[email protected]
[email protected] [email protected]
Abstract
clude supervised metric learning which is the main focus
in the literature, semi-supervised metric learning (Wu et al.
2009; Hoi, Liu, and Chang 2010; Liu et al. 2010a; 2010b;
Niu et al. 2012), etc. The suite of supervised metric learning
can further be divided into three categories in terms of forms
of supervision: the methods supervised by instance-level labels (Goldberger, Roweis, and Salakhutdinov 2004; Globerson and Roweis 2005; Weinberger and Saul 2009), the
methods supervised by pair-level labels (Xing et al. 2002;
Bar-Hillel et al. 2005; Davis et al. 2007; Ying and Li 2012),
and the methods supervised by triplet-level ranks (Ying,
Huang, and Campbell 2009; McFee and Lanckriet 2010;
Shen et al. 2012; Lim, McFee, and Lanckriet 2013). More
methods can also be found in two surveys (Kulis 2012) and
(Bellet, Habrard, and Sebban 2013).
Despite the large amount of literature, relatively little
work has dealt with the problem where the dimensionality of input data can be extremely high. This is relevant
to a wide variety of practical data collections, e.g., images,
videos, documents, time-series, genomes, and so on, which
frequently involve from tens to hundreds of thousands of
dimensions. Unfortunately, learning full-rank metrics in a
high-dimensional input space Rd quickly becomes computationally prohibitive due to high computational complexities
ranging from O(d2 ) to O(d6.5 ). Additionally, limited memory makes storing a huge metric matrix M ∈ Rd×d a heavy
burden.
This paper concentrates on a more reasonable learning
paradigm, Low-Rank Metric Learning, also known as structural metric learning. (Davis and Dhillon 2008) demonstrated that in the small sample setting n d (n is the number of training samples) learning a low-rank metric in a highdimensional input space is tractable, achieving linear space
and time complexities with respect to the dimensionality d.
We follow this learning setting n d throughout the paper. Given a rank r d, a distance metric can be expressed
into a low-rank form M = LL> which is parameterized by
a rectangular matrix L ∈ Rd×r . L constitutes a low-rank
basis in Rd , and also acts as a dimensionality-reducing linear transformation since any distance between two inputs x
and x0 under such a low-rank metric M can be equivalently
viewed as a Euclidean distance between the transformed inputs L> x and L> x0 . Therefore, the use of low-rank metrics allows reduced storage of metric matrices (only saving
Metric learning has become a widespreadly used tool
in machine learning. To reduce expensive costs brought
in by increasing dimensionality, low-rank metric learning arises as it can be more economical in storage and
computation. However, existing low-rank metric learning algorithms usually adopt nonconvex objectives, and
are hence sensitive to the choice of a heuristic low-rank
basis. In this paper, we propose a novel low-rank metric learning algorithm to yield bilinear similarity functions. This algorithm scales linearly with input dimensionality in both space and time, therefore applicable
to high-dimensional data domains. A convex objective
free of heuristics is formulated by leveraging trace norm
regularization to promote low-rankness. Crucially, we
prove that all globally optimal metric solutions must
retain a certain low-rank structure, which enables our
algorithm to decompose the high-dimensional learning
task into two steps: an SVD-based projection and a metric learning problem with reduced dimensionality. The
latter step can be tackled efficiently through employing
a linearized Alternating Direction Method of Multipliers. The efficacy of the proposed algorithm is demonstrated through experiments performed on four benchmark datasets with tens of thousands of dimensions.
Introduction
During the past decade, distance metric learning, typically
referring to learning a Mahalanobis distance metric, has become ubiquitous in a variety of applications stemming from
machine learning, data mining, information retrieval, and
computer vision. Many fundamental machine learning algorithms such as K-means clustering and k-nearest neighbor classification all need an appropriate distance function
to measure pairwise affinity (or closeness) between data
examples. Beyond the naive Euclidean distance, the distance metric learning problem intends to seek a better metric through learning with a training dataset subject to particular constraints arising from fully supervised or semisupervised information, such that the learned metric can reflect domain-specific characteristics of data affinities or relationships. A vast number of distance metric learning approaches have been proposed, and the representatives inc 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
2792
L) and simultaneously enables efficient distance computation (O(dr) time). Despite the benefits of using low-rank
metrics, the existing methods most adopt nonconvex optimization objectives like (Goldberger, Roweis, and Salakhutdinov 2004; Torresani and Lee 2006; Mensink et al. 2013;
Lim and Lanckriet 2014), and are thus confined to the subspace spanned by a heuristic low-rank basis L0 even including the convex method (Davis and Dhillon 2008). In terms of
the training costs, most low-rank metric learning approaches
including (Goldberger, Roweis, and Salakhutdinov 2004;
Torresani and Lee 2006; McFee and Lanckriet 2010; Shen
et al. 2012; Kunapuli and Shavlik 2012; Lim, McFee, and
Lanckriet 2013) yet incur quadratic and cubic time complexities in the dimensionality d. A few methods such as (Davis
and Dhillon 2008) and (Mensink et al. 2013) are known to
provide efficient O(d) procedures, but they both rely on and
are sensitive to an initialization of the heuristic low-rank basis L0 .
This work intends to pursue a convex low-rank metric
learning approach without resorting to any heuristics. Recently, one trend in similarity learning (Kar and Jain 2011;
2012; Chechik et al. 2010; Crammer and Chechik 2012;
Cheng 2013) emerges, which attempts to learn similarity
functions in the form of S(x, x0 ) = x> Mx0 and appears to
be more flexible than distance metric learning. In light of this
trend, we incorporate the benefits of both low-rank metrics
and similarity functions to learn a low-rank metric M that
yields a discriminative similarity function S(·). Specifically,
we propose a novel low-rank metric learning algorithm by
designing a convex objective which consists of a hinge loss
enforced over pairwise supervised information and a trace
norm regularization penalty promoting low-rankness of the
desired metric. Significantly, we show that minimizing such
an objective guarantees to obtain an optimal solution of a
certain low-rank structure as M? = UW? U> . The lowrank basis U included in any optimal metric M? turns out
to lie in the column (or range) space of the input data matrix
X ∈ Rd×n , and W? can thus be sought within a fixed subspace whose dimensionality is much lower than the original
dimensionality.
By virtue of this key finding, our algorithm intelligently
decomposes the challenging high-dimensional learning task
into two steps which are both computationally tractable and
cost a total O(d) time. The first step is dimensionality reduction via SVD-based projection, and the second step is
lower dimensional metric learning by applying a linearized
modification of the Alternating Direction Method of Multipliers (ADMM) (Boyd et al. 2011). The linearized ADMM
maintains analytic-form updates across all iterations, works
efficiently, and converges rapidly. We evaluate the proposed
algorithm on four benchmark datasets with tens of thousands
of dimensions, demonstrating that the low-rank metrics obtained by our algorithm accomplish prominent performance
gains in terms of kNN classification.
Figure 1: The effect of a desired low-rank similarity metric that induces a linear transformation L> . Points with the
same color denote similar data examples, while points with
different colors denote dissimilar data examples.
metric M = LL> for a bilinear similarity function to measure the similarity between two arbitrary inputs x and x0 in
Rd :
SM (x, x0 ) = x> Mx0 = x> LL> x0
> > 0 = L> x
L x ,
(1)
which is parameterized by a low-rank basis L ∈ Rd×r with
the rank r d. Naturally, such a similarity function characterized by a low-rank metric essentially measures the innerproduct between two transformed inputs L> x and L> x0 in
a lower dimensional space Rr .
Motivated by (Kriegel, Schubert, and Zimek 2008) which
presented a statistical means based on angles rather than distances to identify outliers scattering in a high-dimensional
space, we would think that angles own a stronger discriminating power than distances in capturing affinities among
data points in high dimensions. Suppose
that we are provided with a set of pairwise labels yij ∈ {1, −1} of
which yij = 1 stands for a similar data pair (xi , xj ) while
yij = −1 for a dissimilar pair (xi , xj ). The purpose of our
supervised similarity metric learning is to make an acute angle between any similar pair whereas an obtuse angle between any dissimilar pair after applying the learned metric
M or equivalently mapping data points from x to L> x. Formally, this purpose is expressed as
> 0, yij = 1,
SM (xi , xj ) =
(2)
< 0, yij = −1.
Fig. 1 exemplifies our notion for those angles between similar and dissimilar data pairs.
Learning Framework
Let us inherit the conventional deployment of pairwise supervision adopted by most metric learning methods, e.g.,
(Xing et al. 2002; Bar-Hillel et al. 2005; Davis et al. 2007;
Davis and Dhillon 2008; Wu et al. 2009; Hoi, Liu, and
Chang 2010; Liu et al. 2010a; 2010b; Niu et al. 2012;
Ying and Li 2012). This paper tackles a supervised similarity metric learning problem provided with a set of labeled
pairs I = {(xi , xj , yij )}. Suppose that n ( d) data examples X = {xi ∈ Rd }ni=1 are gathered in the set I. Inspired by the aforementioned angle-based learning principle
in Eq. (2), we design a margin criterion over the similarity
Low-Rank Similarity Metrics
Different from most previous metric learning approaches
that
p used the learned metric into a distance function like
(x − x0 )> M(x − x0 ), in this paper we pursue a low-rank
2793
Optimality Characterization
outputs of the supervised pairs, that is, SM (xi , xj ) ≥ 1 for
yij = 1 and SM (xi , xj ) ≤ − ( is a small positive value)
for yij = −1. Note that setting the margin values to 1 and
-1 for positive and negative pairs in the context of similarity
learning may be infeasible for multi-class data. For example, there are data points x1 , x2 , and x3 from three different
classes. Imposing SM (x1 , x2 ) ≤ −1, SM (x1 , x3 ) ≤ −1,
and SM (x2 , x3 ) ≤ −1 is problematic since the former two
likely lead to SM (x2 , x3 ) > 0.
This similarity margin criterion ensures that after applying the linear transformation L> induced by a desired metric M, the angle between a similar pair (xi , xj ) is acute
> > since their inner-product L> xi
L xj = SM (xi , xj )
is large enough, and meanwhile the angle between a dissimilar pair is obtuse since their inner-product is negative.
In order to promote low-rankness of the target metric, we
leverage nuclear norm kMk∗ , the sum of all singular values
of M, as a regularization penalty in our learning framework.
Nuclear norm has been corroborated to be able to encourage
low-rank matrix structure in the matrix completion literature (Fazel, Hindi, and Boyd 2001; Candès and Recht 2009;
Recht, Fazel, and Parrilo 2010). Moreover, since our framework constrains M to be in the positive semidefinite (PSD)
cone Sd+ to make M a valid metric, the nuclear norm kMk∗
can be replaced by the trace norm tr(M).
Integrating the angle principle and the trace norm regularizer, the objective of our proposed Low-Rank Similarity
Metric Learning (LRSML) framework is formulated as
min f (M) :=
M∈Sd
+
n
X
Despite being convex, the framework in Eq. (3) is not easy
to directly tackle when the data points of X live in a space
of high dimensions, i.e., d is very large. Under the highdimensional scenario, basic matrix manipulations upon M
are too expensive to execute. For example, projecting M
onto Sd+ costs O(d3 ). Nonetheless, by delving into problem
(3) we can characterize a certain low-rank structure of its
all optimal solutions. In particular, such a low-rank structure
can be proven to be the form of UWU> , in which U is a
low-rank basis lying in the column space of the data matrix
X.
Let us do the singular value decomposition (SVD) over
the data matrix X = [x1 , · · · , xn ] ∈ Rd×n , obtaining
Pm
>
X = UΣV> =
i=1 σi ui vi , where m (≤ n d) is
the rank of X, σ1 , · · · , σm are the positive singular values,
Σ = diag(σ1 , · · · , σm ), and U = [u1 , · · · , um ] ∈ Rd×m
and V = [v1 , · · · , vm ] ∈ Rn×m are the matrices composed of left- and right-singular vectors, respectively. It is
obvious that range(X) ≡ range(U), in which we write
range(A) = span {ai }i as the column space of a given
matrix A. Significantly, the optimality of problem (3) can
be characterized as follows.
?
Lemma 1. For
solution
M to problem (3), we
any optimal
>
?
m
have M ∈ UWU |W ∈ S+ .
Lemma 1 reveals that although we are seeking M? in Sd+ ,
any optimal solution M? must obey the certain low-rank
structure UWU> because of the nature of problem (3). This
important discovery enables us to first project the data onto
the low-rank basis U and then seek W via optimizing a new
problem in Eq. (4) which is at a much smaller scale than the
raw problem in Eq. (3).
[ỹij − yij SM (xi , xj )]+ + αtr(M),
i,j=1
(3)
Theorem 1. For any i ∈ [1 : n], denote x̃i = U> xi ∈ Rm .
Then M? is an optimal solution to problem (3) if and only if
M? = UW? U> in which W? belongs to the set
, yij = −1
where ỹij =
, [x]+ = max(0, x), and α >
yij , otherwise
0 is the regularization parameter. Above all, the objective
in Eq. (3) is convex and does not introduce any heuristic low-rank basis which was required by the prior convex method (Davis and Dhillon 2008) and those nonconvex methods (Goldberger, Roweis, and Salakhutdinov 2004;
Torresani and Lee 2006; Mensink et al. 2013; Lim and
Lanckriet 2014) in low-rank distance metric learning. Second, minimizing the hinge loss in Eq. (3) enforces the
margin criterion on the similarity function SM so that the
dissimilar pairs can be obviously distinguished from the
similar ones. It is worth clarifying that we do not pursue
SM (xi , xj ) = 0 (i.e., = 0) in the case of yij = −1 in
Eq. (2) because it may lead to the trivial solution M = 0 to
Eq. (3).1
Due to the space limit, all the proofs of lemmas and theorems presented in this section are placed in the supplemental
material.
arg minm f˜(W) :=
W∈S+
n h
i
X
ỹij − yij x̃>
+ αtr(W).
i Wx̃j
i,j=1
+
(4)
Theorem 1 indicates that through a simple projection step
(xi → U> xi ), we are able to reduce the high-dimensional
metric learning task in Eq. (3) to a lower dimensional metric learning problem in Eq. (4) with guarantees, thereby
greatly reducing the computational costs. However, the scale
of problem (4) could still be not neglectable, which severely
limits the use of the off-the-shell interior point based solvers
such as SDPT3 (Toh, Todd, and Tütüncü 1999). In the next
subsection, we will devise an efficient first-order method to
solve problem (4).
Similar projection tricks have been recently exploited and
proven for convex distance metric learning problems regularized by squared Frobenius norm (Chatpatanasiri et al.
2010) and several other strictly convex functions (Kulis
2012), whose proof strategies, however, cannot carry over to
trace norm since trace norm is not strictly convex. To the best
of our knowledge, the structural representation UW? U>
1
We observed the trivial solution M = 0 in the case of = 0
when the training data examples are fewer and the regularization
parameter α is relatively large.
2794
By applying T1/ρ elementwisely, we obtain a closed-form
solution to the Z subproblem as
characterizing optimal metric solutions is firstly explored
here for trace norm regularized convex metric learning.
Linearized ADMM
Zk+1 := T1/ρ Ỹ − Y ◦ (X̃> Wk X̃) − Λk /ρ .
Now we develop an efficient optimization method to exactly
solve Eq. (4). To facilitate derivations, we define three matrix operators: the Hadamard product ◦, the inner-product
h·, ·i, and the Frobenius norm k · kF .
We first introduce a dummy variable matrix Z = [zij ]ij ∈
Rn×n to equivalently rewrite problem (4) as
n
X
min
W∈Sm
+ ,Z
(8)
The Λ-update is prescribed to be
Λk+1 := Λk + ρ Zk+1 − Ỹ + Y ◦ (X̃> Wk+1 X̃) . (9)
[zij ]+ + αtr(W)
(5)
i,j=1
Z − Ỹ + Y ◦ X̃> WX̃ = 0,
s.t.
W-Update
in which three constant matrices X̃ = [x̃1 , · · · , x̃n ] =
U> X ∈ Rm×n , Y = [yij ]ij ∈ Rn×n , and Ỹ = [ỹij ]ij ∈
Rn×n are involved. Either of the objective function and constraint in Eq. (5) is separable in terms of two variables Z
and W, which naturally suggests a use of alternating direction methods. To this end, we employ the increasingly popular Alternating Direction Method of Multipliers (ADMM)
(Boyd et al. 2011) to cope with Eq. (5). ADMM works on
the following augmented Lagrangian function:
Lρ (Z, W; Λ) :=
n
X
In the standard ADMM framework, W is updated by minimizing Lρ (Z, W; Λ) with respect to W while Z and Λ are
fixed. Noticing X̃X̃> = ΣV> VΣ = Σ2 , we derive the
W-update as follows
Wk+1 := arg minm Lρ Zk+1 , W; Λk
W∈S+
D
E
= arg minm αtr(W) + Λk , Y ◦ X̃> WX̃
[zij ]+ + αtr(W)
W∈S+
i,j=1
where Λ = [λij ]ij ∈ Rn×n is the multiplier for the linear constraint in Eq. (5) (a.k.a. dual variable), and ρ > 0 is
the penalty parameter. ADMM will proceed through successively and alternatingly updating the three variable matrices
Zk , Wk and Λk (k = 0, 1, · · · ).
ρ
k+1
2
− Ỹ + Y ◦ X̃> WX̃ Z
2
F
2
ρ
>
X̃ WX̃ F + hαI, Wi
= arg minm
W∈S+ 2
D
E
+ Λk + ρZk+1 − ρỸ, Y ◦ X̃> WX̃
ρ
= arg minm WΣ2 , Σ2 W + hαI, Wi
W∈S+ 2
E
D
+ X̃ Y ◦ (Λk + ρZk+1 − ρỸ) X̃> , W
Z, Λ-Updates
= arg minm
Here we first present the updates for Z and Λ which are
easier to achieve than the update for W.
The Z-update is achieved by minimizing Lρ (Z, W; Λ)
with respect to Z while keeping W and Λ fixed, that is,
D
E
+ αI/ρ + X̃ Y ◦ (Zk+1 − Ỹ + Λk /ρ) X̃> , W
= arg minm g W; Zk+1 , Λk ,
(10)
+
D
E
+ Λ, Z − Ỹ + Y ◦ X̃> WX̃
ρ
2
+ Z − Ỹ + Y ◦ X̃> WX̃ , (6)
2
F
W∈S+
W∈S+
n
X
Zk+1 := arg min Lρ Z, Wk ; Λk = arg min
[zij ]+
Z
+
Z
1
WΣ2 , Σ2 W
2
where I denotes the m × m identity matrix. Unfortunately,
there is no analytical solution available for this W subproblem in Eq. (10). Trying iterative optimization methods
to solve Eq. (10) would, however, limit the efficiency of
ADMM. In what follows, we develop a linearized modification of the original ADMM framework in Eqs. (8)(10)(9)
by rectifying the W-update in Eq. (10) to an easier one.
i,j=1
2
ρ
Z − Ỹ + Y ◦ X̃> Wk X̃ + Λk /ρ .
2
F
To solve this Z subproblem, we utilize the proximal operator
Tθ (a) = arg minz∈R θ[z]+ + 12 (z − a)2 (θ > 0) to handle
[zij ]+ . Tθ (a) is a proximal mapping of θ[z]+ and has shown
in (Ye, Chen, and Xie 2009) to output as follows
(
a − θ, a > θ,
0,
0 ≤ a ≤ θ,
Tθ (a) =
(7)
a,
a < 0.
Linearization. Rather than dealing with intrinsically
quadratic g W; Zk+1 , Λk , weresort to optimizing a surrogate function gbτ W; Zk+1 , Λk which is the linear approximation to g and augmented by a quadratic proximal term.
2795
Algorithm 1
Concretely, the rectified W-update is shown as
Wk+1 := arg minm gbτ W; Zk+1 , Λk
Input: the data matrix X ∈ Rd×n (d n), the pairwise label matrix
Y ∈ {1, −1}n×n , the regularization parameter α > 0, three positive constants
, ρ, τ , the budget iteration number T .
SVD Projection: perform the SVD of X = UΣV> to obtain U ∈ Rd×m ,
, and compute the projected data X̃ = U> X ∈ Rm×n .
Σ ∈ Rm×m
+
Linearized ADMM:
form the matrix Ỹ ∈ {1, }n×n according to Y;
initialize W0 = Im×m , Λ̃0 = 0n×n , S0 = X̃> X̃;
for k = 1, · · · , T do
Zk ←− T1/ρ Ỹ − Y ◦ Sk−1 − Λ̃k−1 ,
k−1
k
α
G
←− ρ Im×m + X̃ Y ◦(Z − Ỹ + Λ̃k−1 ) X̃> +Σ2 Wk−1 Σ2 ,
k
k
k−1
k−1
>
W ←− PSm W
− τG
, S ←− X̃ Wk X̃,
W∈S+
1
W − Wk 2 + g Wk ; Zk+1 , Λk +
F
W∈S+ 2τ
k+1
k k
+ ∇W g W; Z
,W − W
,Λ k
= arg minm
W=W
1
W − Wk 2
= arg minm Gk , W − Wk +
F
W∈S+
2τ
1 W − Wk − τ Gk 2
= arg minm
F
W∈S+ 2τ
= PSm
Wk − τ Gk .
(11)
+
+
Λ̃k ←− Λ̃k−1 + Zk − Ỹ + Y ◦ Sk ,
if Zk − Ỹ + Y ◦ Sk 2F /n2 ≤ 10−12 then
break,
end if
end for.
Output: run the eigen-decomposition of Wk = HEH> in which E ∈ Rr×r
retains the positive eigenvalues, and then output the optimal low-rank basis L? =
UHE1/2 ∈ Rd×r (the corresponding low-rank metric is M? = L? L? > ).
In Eq. (11) we introduce the matrix
Gk = ∇W g W; Zk+1 , Λk Low-Rank Similarity Metric Learning
(12)
W=Wk
k
= αI/ρ + X̃ Y ◦ (Zk+1 − Ỹ + Λ /ρ) X̃> + Σ2 Wk Σ2 .
The introduced operator PSm
(·) represents the projection
+
(A) =
operation onto the PSD cone Sm
+ ; for example, PSm
+
Pm
>
m×m
i=1 [γi ]+ pi pi for any symmetric matrix A ∈ R
whose eigenvector-eigenvalue pairs are (pi , γi )m
.
The
ini=1
troduced parameter τ > 0 is the step size which is to be
specified later.
The advantage of linearizing g W; Zk+1 , Λk is that the
resulting W-update in Eq. (11) enjoys an analytical expression and consequently works very efficiently. This modification of the standard ADMM, i.e., optimizing gbτ instead of
g, is the core ingredient of our developed linearized ADMM
framework, namely linearized ADMM. The following theorem certifies that the developed linearized ADMM is theoretically sound, producing a globally optimal solution to the
raw problem in Eq. (5).
matrix M; instead, it only needs to store the low-rank basis U and update the much smaller m × m metric matrix
Wk , as indicated by Theorem 1. In doing so, LRSML circumvents expensive matrix manipulations such as projecting
M onto Sd+ , therefore accomplishing considerable savings in
both storage and computation. The total time complexity of
LRSML is bounded by O(dn2 + T n3 ) which scales linearly
with the input dimensionality d. In practice, we find that the
linearized ADMM converges rapidly within T = 1, 000 iterations under the setting of ρ = 1, τ = 0.01 (see the convergence curves in the supplemental material).
Discussion
A primary advantage of our analysis in Lemma 1 and Theorem 1 is avoiding the expensive projection onto the highdimensional PSD cone Sd+ , which was required by the previous trace norm regularized metric learning methods such
as (McFee and Lanckriet 2010; Lim, McFee, and Lanckriet 2013). In the supplemental material, we further provide
in-depth theoretic analysis (see Lemma 3 and Theorem 3)
to comprehensively justify the low-rank solution structure
M? = UW? U> for any convex loss function in terms of
x>
i Mxj regularized by trace norm tr(M) or squared Frobenius norm kMk2F . As a result, our analysis would directly
lead to scalable O(d) algorithms for a task of low-rank distance or similarity metric learning supervised by instancelevel, pairwise, or listwise label information. For example,
our analysis would give an O(d)-time algorithm for optimizing the low-rank distance metric learning objective (a hinge
loss based on listwise supervision plus a trace norm regularizer) in (McFee and Lanckriet 2010) through following our
proposed two-step scheme, SVD projection + lower dimensional metric learning.
We remark here that linearized ADMM techniques have
been investigated in the optimization community, and applied to the problems arising from compressed sensing
(Yang and Zhang 2011), image processing (Zhang et al.
1
1
= kXk
Theorem 2. Given 0 < τ < kX̃k
4 , the sequence
4
op
op
k
Z , Wk , Λk k generated by the linearized ADMM in
Eqs. (8)(11)(9) starting with any symmetric Z0 , W0 , Λ0
converges to an optimal solution of problem (5).
Note that k · kop denotes the operator norm of matrices.
For simplicity, we initialize W0 = Im×m and Λ0 = 0n×n
to launch the linearized ADMM.
Algorithm
We summarize the proposed Low-Rank Similarity Metric
Learning (LRSML) approach in Algorithm 1, which consists of two steps: 1) SVD-based projection, and 2) lower
dimensional metric learning via the linearized ADMM. In
the projection step, the SVD of X can be efficiently performed in O(dn2 ) time by running the eigen-decomposition
over the matrix X> X. If the data vectors in X are zerocentered, such a decomposition is exactly PCA. In the linearized ADMM step, we use the scaled dual variable matrix
Λ̃k = Λk /ρ. Importantly, we claim that there is no need
to explicitly compute and even store the large d × d metric
2796
et al. 2013) (MLNCM also uses LSA’s output as its heuristic low-rank basis), a similarity learning method Adaptive
Regularization Of MAtrix models (AROMA) (Crammer and
Chechik 2012), and the Low-Rank Similarity Metric Learning (LRSML) method proposed in this paper. Except MLNCM which requires to use the Nearest Class Mean classifier, all the compared metric/similarity learning methods collaborate with the 1NN classifier. Note that we choose to run
LSA instead of PCA because of the nonnegative and sparse
nature of the used feature vectors. LSA-ITML, CM-ITML,
MLR and LRSML are convex methods, while MLNCM is a
nonconvex method. We choose the most efficient version of
AROMA, which restricts the matrix M used in the similarity function to be diagonal. To efficiently implement MLR,
we follow the two-step scheme, which has been justified in
this paper, SVD projection followed by MLR with lower dimensionality.
Table 1: Basic descriptions of four datasets used in the experiments.
Dataset
Reuters-28
TDT2-30
UIUC-Sports
UIUC-Scene
# Classes
28
30
8
15
# Samples
8,030
9,394
1,579
4,485
# Dimensions
18,933
36,771
87,040
21,504
2010), and nuclear norm regularized convex optimization
(Yang and Yuan 2013). Nevertheless, the linearized ADMM
method that we developed in this paper is self-motivated in
the sense that it was derived for solving a particular optimization problem of low-rank metric learning. To the best
of our knowledge, linearized ADMM has not been applied
to metric learning before, and the analytic-form updates produced by our linearized ADMM method are quite advantageous. Note that (Lim, McFee, and Lanckriet 2013) applied
the standard ADMM to solve a subproblem in every iteration
of their optimization method, but executing ADMM at only
one time costs the cubic time complexity O(T 0 d3 ) (T 0 is
the iteration number of ADMM) which is computationally
infeasible in high dimensions. By contrast, our linearized
ADMM tackles the entire metric optimization in O(T n3 )
time, performing at least one order of magnitude faster than
the standard ADMM by virtue of the elegant W-update.
Except AROMA which does not impose the PSD constraint on M, each of these referred methods can yield a
low-rank basis L. For them, we try
three measures
between
two inputs x and x0 : (i) distance L> x − L> x0 , (ii) inner> > 0 >
x)> (L> x0 )
product L> x
L x , and (iii) cosine (L
.
kL> xkkL> x0 k
For LSA, FLDA, LSA-ITML, CM-ITML, MLR and MLNCM, distance and cosine measures are tried; while for
LRSML, inner-product and cosine measures are tried since
the inner-product measure (ii) is exactly the bilinear similarity function shown in Eq. (1). The baseline “Original” gives
the same results under the three measures as the feature vectors are `2 normalized already.
Experiments
We carry out the experiments on four benchmark datasets
including two document datasets Reuters-28 and TDT2-30
(Cai, He, and Han 2011), and two image datasets UIUCSports (Li and Fei-Fei 2007) and UIUC-Scene (Lazebnik,
Schmid, and Ponce 2006). Note that we choose 28 categories in the raw Reuters-21578 corpus (Cai, He, and Han
2011) such that each category has no less than 20 examples. In Reuters-28 and TDT2-30, each document is represented by an `2 normalized TF-IDF feature vector; in UIUCSports and UIUC-Scene, each image is represented by an `2
normalized sparse-coding feature vector (Wang et al. 2010).
The basic information about these four datasets is shown in
Table 1.
The baseline method is named as “Original”, which
takes original feature vectors for 1NN classification, and
the linear SVM is also included in comparison. Here we
evaluate and compare eight metric and similarity learning
methods including: two classical linear dimensionality reduction methods Latent Semantic Analysis (LSA) (Deerwester et al. 1990) and Fisher Linear Discriminant Analysis (FLDA) (Hastie, Tibshirani, and Friedman 2009), two
low-rank variants LSA-ITML and CM-ITML (Davis and
Dhillon 2008) of a representative distance metric learning method Information-Theoretic Metric Learning (ITML)
(Davis et al. 2007) (LSA-ITML uses LSA’s output as its
heuristic low-rank basis, while CM-ITML uses the class
means as its heuristic low-rank basis), two recent low-rank
distance metric learning methods Metric Learning to Rank
(MLR) (McFee and Lanckriet 2010) and Metric Learning
for the Nearest Class Mean classifier (MLNCM) (Mensink
Let C be the number of classes in every dataset. On
Reuters-28 and TDT2-30, we select 5 × C up to 30 × C
samples for training such that each category covers at least
one sample; we pick up the same number of samples for
cross-validation; the rest of samples are for testing. We repeat 20 times of random training/validation/test splits, and
then report the average classification error rates and training
time for all the competing methods in Tables 2 and 3. On
the two image datasets, we follow the commonly used evaluation protocols like (Lazebnik, Schmid, and Ponce 2006;
Wang et al. 2010). On UIUC-Sports, we select 10 to 70
samples per class for training, a half number of samples for
validation, and the remaining ones for testing; on UIUCScene, we obey the similar setting and the training samples range from 10 to 100 per class. By running 20 times,
the average of per-class recognition rates as well as the
training time are reported in the tables of the supplemental material. To run our proposed method LRSML, we fix
= 0.1, ρ = 1, and find that τ = 0.01 makes the linearized ADMM converge within T = 1, 000 iterations on all
datasets. The results shown in all the referred tables demonstrate that LRSML consistently leads to the highest accuracy
in terms of 1NN classification. Across all datasets, the best
two results are achieved by LRSML with two measures; the
cosine measure almost results in slightly higher classification accuracy than the inner-product measure. More experimental results are presented in the supplemental material.
2797
Conclusions
Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S. 2007.
Information-theoretic metric learning. In Proc. ICML.
Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, G. W.;
and Harshman, R. A. 1990. Indexing by latent semantic analysis.
Journal of the American Society of Information Science 41(6):391–
407.
Fazel, M.; Hindi, H.; and Boyd, S. P. 2001. A rank minimization heuristic with application to minimum order system approximation. In Proc. the American Control Conference, volume 6,
4734–4739.
Globerson, A., and Roweis, S. 2005. Metric learning by collapsing
classes. In NIPS 18.
Goldberger, J.; Roweis, S.; and Salakhutdinov, R. 2004. Neighbourhood components analysis. In NIPS 17.
Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The Elements
of Statistical Learning: Data Mining, Inference, and Prediction.
Second Edition, Springer.
Hoi, S. C.; Liu, W.; and Chang, S.-F. 2010. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Transactions on Multimedia Computing, Communications and Applications 6(3).
Kar, P., and Jain, P. 2011. Similarity-based learning via data driven
embeddings. In NIPS 24.
Kar, P., and Jain, P. 2012. Supervised learning with similarity
functions. In NIPS 25.
Kriegel, H.-P.; Schubert, M.; and Zimek, A. 2008. Angle-based
outlier detection in high-dimensional data. In Proc. KDD.
Kulis, B. 2012. Metric learning: a survey. Foundations and
Trendsr in Machine Learning 5(4):287–364.
Kunapuli, G., and Shavlik, J. 2012. Mirror descent for metric
learning: A unified approach. In Proc. ECML.
Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond bags of
features: Spatial pyramid matching for recognizing natural scene
categories. In Proc. CVPR.
Li, L.-J., and Fei-Fei, L. 2007. What, where and who? classifying
event by scene and object recognition. In Proc. ICCV.
Lim, D. K. H., and Lanckriet, G. 2014. Efficient learning of mahalanobis metrics for ranking. In Proc. ICML.
Lim, D. K. H.; McFee, B.; and Lanckriet, G. 2013. Robust structural metric learning. In Proc. ICML.
Liu, W.; Ma, S.; Tao, D.; Liu, J.; and Liu, P. 2010a. Semisupervised sparse metric learning using alternating linearization
optimization. In Proc. KDD.
Liu, W.; Tian, X.; Tao, D.; and Liu, J. 2010b. Constrained metric
learning via distance gap maximization. In Proc. AAAI.
McFee, B., and Lanckriet, G. 2010. Metric learning to rank. In
Proc. ICML.
Mensink, T.; Verbeek, J.; Perronnin, F.; and Csurka, G. 2013.
Distance-based image classification: Generalizing to new classes
at near zero cost. TPAMI 35(11):2624–2637.
Niu, G.; Dai, B.; Yamada, M.; and Sugiyama, M.
2012.
Information-theoretic semi-supervised metric learning via entropy
regularization. In Proc. ICML.
Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed
minimum-rank solutions of linear matrix equations via nuclear
norm minimization. SIAM review 52(3):471–501.
Shen, C.; Kim, J.; Wang, L.; and van den Hengel, A. 2012. Positive
semidefinite metric learning using boosting-like algorithms. JMLR
13:1007–1036.
The proposed LRSML formulated a trace norm regularized
convex objective which was proven to yield a globally optimal solution with a certain low-rank structure. With the optimality guarantee, the challenging high-dimensional metric
learning task can be reduced to a lower dimensional metric
learning problem after a simple SVD projection. The linearized ADMM was further developed to efficiently solve
the reduced problem. Our approach bears out to maintain
linear space and time complexities in the input dimensionality, consequently scalable to high-dimensional data domains. Through extensive experiments carried out on four
benchmark datasets with up to 87,000 dimensions, we certify that our learned low-rank similarity metrics are well
suited to high-dimensional problems and exhibit prominent
performance gains in kNN classification. In future work, we
would like to investigate generalization properties of the proposed LRSML like (Cao, Guo, and Ying 2012), and aim
to accomplish a tighter generalization bound for trace norm
regularized convex metric learning.
Acknowledgments
When doing an earlier version of this work, Wei Liu was
supported by Josef Raviv Memorial Postdoctoral Fellowship. The authors would like to thank Prof. Gang Hua and
Dr. Fuxin Li for constructive suggestions which are helpful
to making the final version.
References
Bar-Hillel, A.; Hertz, T.; Shental, N.; and Weinshall, D. 2005.
Learning a mahalanobis metric from equivalence constraints.
JMLR 6:937–965.
Bellet, A.; Habrard, A.; and Sebban, M. 2013. A survey on metric
learning for feature vectors and structured data. arXiv:1306.6709.
Boyd, S. P.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011.
Distributed optimization and statistical learning via the alternating
direction method of multipliers. Foundations and Trendsr in Machine Learning 3(1):1–122.
Cai, D.; He, X.; and Han, J. 2011. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering 23(6):902–913.
Candès, E. J., and Recht, B. 2009. Exact matrix completion via
convex optimization. Foundations of Computational mathematics
9(6):717–772.
Cao, Q.; Guo, Z.-C.; and Ying, Y. 2012. Generalization bounds for
metric and similarity learning. Technical report.
Chatpatanasiri, R.; Korsrilabutr, T.; Tangchanachaianan, P.; and Kijsirikul, B. 2010. A new kernelization framework for mahalanobis
distance learning algorithms. Neurocomputing 73(10):1570–1579.
Chechik, G.; Sharma, V.; Shalit, U.; and Bengio, S. 2010. Large
scale online learning of image similarity through ranking. JMLR
11:1109–1135.
Cheng, L. 2013. Riemannian similarity learning. In Proc. ICML.
Crammer, K., and Chechik, G. 2012. Adaptive regularization for
weight matrices. In Proc. ICML.
Davis, J. V., and Dhillon, I. S. 2008. Structured metric learning for
high dimensional problems. In Proc. KDD.
2798
Table 2: Reuters-28 dataset: classification error rates and training time of eight competing metric/similarity learning methods.
Method
Original
Linear SVM
LSA
FLDA
LSA-ITML
CM-ITML
MLR
MLNCM
AROMA
LRSML
5×28 training samples
Measure
Error Rate Train Time
(%)
(sec)
–
23.94±1.63
–
–
21.25±1.46
0.03
distance
42.21 ±4.48
0.02
cosine
21.99±1.89
distance
39.46±6.10
0.03
cosine
21.34±3.07
distance
20.17±0.95
47.2
cosine
21.29±1.02
distance
33.38±3.57
7.3
cosine
23.49±1.42
distance
27.08±2.75
17.8
cosine
20.22±1.44
distance
21.07±1.62
1.1
cosine
22.88±2.52
similarity
21.25±1.53
33.1
inner-product 15.56±1.55
9.1
cosine
14.83±1.20
30×28 training samples
Measure
Error Rate Train Time
(%)
(sec)
–
16.56±0.55
–
–
9.16±0.43
0.12
distance
24.56±5.30
0.55
cosine
11.85±0.48
distance
15.19±1.38
0.75
cosine
10.05±0.65
distance
10.06±0.99
6158.0
cosine
9.06±0.45
distance
18.18±0.74
219.3
cosine
13.28±0.76
distance
13.29±1.14
1805.7
cosine
9.86±0.53
distance
12.62±0.55
138.5
cosine
13.51±0.79
similarity
13.49±0.63
2328.1
inner-product
8.84±0.61
676.0
cosine
7.29±0.40
Table 3: TDT2-30 dataset: classification error rates and training time of eight competing metric/similarity learning methods.
Method
Original
Linear SVM
LSA
FLDA
LSA-ITML
CM-ITML
MLR
MLNCM
AROMA
LRSML
5×30 training samples
Measure
Error Rate
Train Time
(%)
(sec)
–
10.12±1.24
–
–
9.80±2.46
0.06
distance
30.01±10.14
0.02
cosine
8.19±1.49
distance
36.57±4.80
0.04
cosine
8.77±1.96
distance
11.35±2.99
35.7
cosine
5.85±1.14
distance
32.56±4.87
7.8
cosine
8.75±2.48
distance
18.50±3.51
30.1
cosine
8.04±1.11
distance
8.48±1.38
1.5
cosine
8.58±1.64
similarity
11.31±2.55
35.6
inner-product
4.74±0.83
10.2
cosine
4.95±1.06
Toh, K. C.; Todd, M. J.; and Tütüncü, R. H. 1999. Sdpt3 - a matlab software package for semidefinite programming, version 1.3.
Optimization Methods and Software 11(1-4):545–581.
30×30 training samples
Measure
Error Rate Train Time
(%)
(sec)
–
6.75±0.39
–
–
2.59±0.24
0.37
distance
8.52±1.19
0.68
cosine
4.26±0.44
distance
7.50±0.62
1.08
cosine
3.56±0.34
distance
5.68±0.76
1027.4
cosine
2.84±0.33
distance
7.67±0.71
297.7
cosine
3.42±0.36
distance
7.50±0.75
1036.4
cosine
3.84±0.29
distance
4.69±0.32
231.5
cosine
4.39±0.38
similarity
4.27±0.41
2655.6
inner-product
2.37±0.20
813.9
cosine
2.27±0.15
Yang, J., and Yuan, X. 2013. Linearized augmented lagrangian
and alternating direction methods for nuclear norm minimization.
Mathematics of Computation 82(281):301–329.
Yang, J., and Zhang, Y. 2011. Alternating direction algorithms for
`1 problems in compressive sensing. SIAM Journal on Scientific
Computing 33(1):250–278.
Ye, G.-B.; Chen, Y.; and Xie, X. 2009. Efficient variable selection
in support vector machines via the alternating direction method of
multipliers. In Proc. AISTATS.
Ying, Y., and Li, P. 2012. Distance metric learning with eigenvalue
optimization. JMLR 13:1–26.
Ying, Y.; Huang, K.; and Campbell, C. 2009. Sparse metric learning via smooth optimization. In NIPS 22.
Zhang, X.; Burger, M.; Bresson, X.; and Osher, S. 2010. Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM Journal on Imaging Sciences 3(3):253–276.
Torresani, L., and Lee, K.-C. 2006. Large margin component analysis. In NIPS 19.
Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; and Gong, Y. 2010.
Locality-constrained linear coding for image classification. In
Proc. CVPR.
Weinberger, K. Q., and Saul, L. K. 2009. Distance metric learning
for large margin nearest neighbor classification. JMLR 10:207–
244.
Wu, L.; Jin, R.; Hoi, S. C. H.; Zhu, J.; and Yu, N. 2009. Learning
bregman distance functions and its application for semi-supervised
clustering. In NIPS 22.
Xing, E. P.; Ng, A. Y.; Jordan, M. I.; and Russell, S. 2002.
Distance metric learning with application to clustering with sideinformation. In NIPS 15.
2799