Optimization Techniques for Semi-supervised Support Vector Machines

Journal of Machine Learning Research 9 (2008) 203-233
Submitted 5/07; Published 2/08
Optimization Techniques for
Semi-Supervised Support Vector Machines
Olivier Chapelle∗
CHAP @ YAHOO - INC . COM
Yahoo! Research
2821 Mission College Blvd
Santa Clara, CA 95054
Vikas Sindhwani
VIKASS @ CS . UCHICAGO . EDU
University of Chicago, Department of Computer Science
1100 E 58th Street
Chicago, IL 60637
Sathiya S. Keerthi
SELVARAK @ YAHOO - INC . COM
Yahoo! Research
2821 Mission College Blvd
Santa Clara, CA 95054
Editor: Nello Cristianini
Abstract
Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based
on applying the margin maximization principle to both labeled and unlabeled examples. Unlike
SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have
recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The
performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting.
Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
1. Introduction
In many applications of machine learning, abundant amounts of data can be cheaply and automatically collected. However, manual labeling for the purposes of training learning algorithms is often
a slow, expensive, and error-prone process. The goal of semi-supervised learning is to employ the
large collection of unlabeled data jointly with a few labeled examples for improving generalization
performance.
The design of Support Vector Machines (SVMs) that can handle partially labeled data sets has
naturally been a vigorously active subject. A major body of work is based on the following idea:
solve the standard SVM problem while treating the unknown labels as additional optimization variables. By maximizing the margin in the presence of unlabeled data, one learns a decision boundary that traverses through low data-density regions while respecting labels in the input space. In
other words, this approach implements the cluster assumption for semi-supervised learning—that is,
∗. Most of the work was done while at MPI for Biological Cybernetics,T übingen, Germany.
c
2008
Olivier Chapelle, Vikas Sindhwani and Sathiya S. Keerthi.
C HAPELLE , S INDHWANI AND K EERTHI
Figure 1: Two moons. There are 2 labeled points (the triangle and the cross) and 100 unlabeled
points. The global optimum of S3 VM correctly identifies the decision boundary (black
line).
points in a data cluster have similar labels (Seeger, 2006; Chapelle and Zien, 2005). Figure 1 illustrates a low-density decision surface implementing the cluster assumption on a toy two-dimensional
data set. This idea was first introduced by Vapnik and Sterin (1977) under the name Transductive SVM, but since it learns an inductive rule defined over the entire input space, we refer to this
approach as Semi-Supervised SVM (S3 VM).
Since its first implementation by Joachims (1999), a wide spectrum of techniques have been
applied to solve the non-convex optimization problem associated with S 3 VMs, for example, local
combinatorial search (Joachims, 1999), gradient descent (Chapelle and Zien, 2005), continuation
techniques (Chapelle et al., 2006a), convex-concave procedures (Fung and Mangasarian, 2001;
Collobert et al., 2006), semi-definite programming (Bie and Cristianini, 2006; Xu et al., 2004),
non-differentiable methods (Astorino and Fuduli, 2007), deterministic annealing (Sindhwani et al.,
2006), and branch-and-bound algorithms (Bennett and Demiriz, 1998; Chapelle et al., 2006c).
While non-convexity is partly responsible for this diversity of methods, it is also a departure
from one of the nicest aspects of SVMs. Table 1 benchmarks the empirical performance of various
S3 VM implementations against the globally optimal solution obtained by a Branch-and-Bound algorithm. These empirical observations strengthen the conjecture that the performance variability of
S3 VM implementations is closely tied to their susceptibility to sub-optimal local minima. Together
with several subtle implementation differences, this makes it challenging to cross-compare different
S3 VM algorithms.
The aim of this paper is to provide a review of optimization techniques for semi-supervised
SVMs and to bring different implementations, and various aspects of their empirical performance,
under a common experimental setting.
In Section 2 we discuss the general formulation of S3 VMs. In Sections 3 and 4 we provide
an overview of various methods. We present a detailed empirical study in Section 5 and present a
discussion on complexity in Section 6.
204
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Coil3
2moons
∇S3 VM
61.6
61
cS3 VM
61
37.7
CCCP
56.6
63.1
S3 VMlight
56.7
68.8
∇DA
61.6
22.5
Newton
61.5
11
BB
0
0
Table 1: Generalization performance (error rates) of different S 3 VM algorithms on two (small) data
sets Coil3 and 2moons. Branch and Bound (BB) yields the globally optimal solution
which gives perfect separation. BB can only be applied to small data sets due to its high
computational costs. See Section 5 for experimental details.
2. Semi-Supervised Support Vector Machines
We consider the problem of binary classification. The training set consists of l labeled examples
{(xi , yi )}li=1 , yi = ±1, and u unlabeled examples {xi }ni=l+1 , with n = l + u. In the linear S3 VM classification setting, the following minimization problem is solved over both the hyperplane parameters
(w, b) and the label vector yu := [yl+1 . . . yn ]> ,
l
n
1
min I (w, b, yu ) = kwk2 +C ∑ V (yi , oi ) +C? ∑ V (yi , oi )
2
(w,b), yu
i=1
i=l+1
(1)
where oi = w> xi + b and V is a loss function. The Hinge loss is a popular choice for V ,
V (yi , oi ) = max (0, 1 − yi oi ) p .
(2)
It is common to penalize the Hinge loss either linearly (p = 1) or quadratically (p = 2). In the rest
of the paper, we will consider p = 2. Non-linear decision boundaries can be constructed using the
kernel trick (Vapnik, 1998).
The first two terms in the objective function I in (1) define a standard SVM. The third term
incorporates unlabeled data. The loss over labeled and unlabeled examples is weighted by two
hyperparameters, C and C ? , which reflect confidence in labels and in the cluster assumption respectively. In general, C and C ? need to be set at different values for optimal generalization performance.
The minimization problem (1) is solved under the following class balancing constraint,
1 n
∑ max(yi , 0) = r or equivalently
u i=l+1
1 n
∑ yi = 2r − 1.
u i=l+1
(3)
This constraint helps in avoiding unbalanced solutions by enforcing that a certain user-specified
fraction, r, of the unlabeled data should be assigned to the positive class. It was introduced with
the first S3 VM implementation (Joachims, 1999). Since the true class ratio is unknown for the
unlabeled data, r is estimated from the class ratio on the labeled set, or from prior knowledge about
the classification problem.
There are two broad strategies for minimizing I :
1. Combinatorial Optimization: For a given fixed yu , the optimization over (w, b) is a standard
SVM training.1 Let us define:
J (yu ) = min I (w, b, yu ).
w,b
1. The SVM training is slightly modified to take into account different values for C and C ? .
205
(4)
C HAPELLE , S INDHWANI AND K EERTHI
The goal now is to minimize J over a set of binary variables. This combinatorial view of the
optimization problem is adopted by Joachims (1999), Bie and Cristianini (2006), Xu et al.
(2004), Sindhwani et al. (2006), Bennett and Demiriz (1998), and Chapelle et al. (2006c).
There is no known algorithm that finds the global optimum efficiently. In Section 3 we review
this class of techniques.
2. Continuous Optimization: For a fixed (w, b), arg miny V (y, o) = sign(o). Therefore, the
optimal yu is simply given by the signs of oi = w> xi + b. Eliminating yu in this manner gives
a continuous objective function over (w, b):
l
n
1
kwk2 +C ∑ max (0, 1 − yi oi )2 +C? ∑ max (0, 1 − |oi |)2 .
2
i=1
i=l+1
(5)
This form of the optimization problem illustrates how S3 VMs implement the cluster assumption. The first two terms in (5) correspond to a standard SVM. The last term (see Figure 2)
drives the decision boundary, that is, the zero output contour, away from unlabeled points.
From Figure 2, it is clear that the objective function is non-convex.
1
Loss
0.8
0.6
0.4
0.2
0
−1
−0.5
0
Signed output
0.5
1
1.5
Figure 2: The effective loss max (0, 1 − |o|)2 is shown above as a function of o = (w> x + b), the
real-valued output at an unlabeled point x.
Note that in this form, the balance constraint becomes 1u ∑ni=l+1 sign(w> xi +b) = 2r −1 which
is non-linear in (w, b) and not straightforward to enforce. In Section 4 we review this class
of methods (Chapelle and Zien, 2005; Chapelle et al., 2006a; Fung and Mangasarian, 2001;
Collobert et al., 2006).
3. Combinatorial Optimization
We now discuss combinatorial techniques in which the labels y u of the unlabeled points are explicit
optimization variables. Many of the techniques discussed in this section call a standard (or slightly
modified) supervised SVM as a subroutine to perform the minimization over (w, b) for a fixed y u
(see 4).
206
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
3.1 Branch-and-Bound (BB) for Global Optimization
The objective function (4) can be globally optimized using Branch-and-Bound techniques. This was
noted in the context of S3 VM by Wapnik and Tscherwonenkis (1979) but no details were presented
there. In general, global optimization can be computationally very demanding. The technique
described in this section is impractical for large data sets. However, with effective heuristics it
can produce globally optimal solutions for small-sized problems. This is useful for benchmarking
practical S3 VM implementations. Indeed, as Table 1 suggests, the exact solution can return excellent
generalization performance in situations where other implementations fail completely. Branch-andBound was first applied by Bennett and Demiriz (1998) in association with integer programming for
solving linear S3 VMs. More recently Chapelle et al. (2006c) presented a Branch-and-Bound (BB)
algorithm which we outline in this section. The main ideas are illustrated in Figure 3.
Initial labeled set
12.7
y3=0
y3=1
14.3
y5=1
Objective function on
currently labeled points
15.6
y5=0
y7=0
y7=1
23.3
Do not explore
Increasing
objective function
17.8
Best solution so far
Figure 3: Branch-and-Bound Tree
Branch-and-Bound effectively performs an exhaustive search over y u , pruning large parts of
the solution space based on the following simple observation: suppose that a lower bound on
minyu ∈A J (yu ), for some subset A of candidate solutions, is greater than J ( ỹu ) for some ỹu , then
A can be safely discarded from exhaustive search. BB organizes subsets of solutions into a binary
tree (Figure 3) where nodes are associated with a fixed partial labeling of the unlabeled data set and
the two children correspond to the labeling of some new unlabeled point. Thus, the root corresponds
to the initial set of labeled examples and the leaves correspond to a complete labeling of the data.
Any node is then associated with the subset of candidate solutions that appear at the leaves of the
subtree rooted at that node (all possible ways of completing the labeling, given the partial labeling
at that node). This subset can potentially be pruned from the search by the Branch-and-Bound procedure if a lower bound over corresponding objective values turns out to be worse than an available
solution.
The effectiveness of BB depends on the following design issues: (1) the lower bound at a node
and (2) the sequence of unlabeled examples to branch on. For the lower bound, Chapelle et al.
(2006c) use the objective value of a standard SVM trained on the associated (extended) labeled
207
C HAPELLE , S INDHWANI AND K EERTHI
set.2 As one goes down the tree, this objective value increases as additional loss terms are added,
and eventually equals J at the leaves. Note that once a node violates the balance constraint, it can
be immediately pruned by resetting its lower bound to ∞. Chapelle et al. (2006c) use a labelingconfidence criterion to choose an unlabeled example and a label on which to branch. The tree is
explored on the fly by depth-first search. This confidence-based tree exploration is also intuitively
linked to Label Propagation methods (Zhu and Ghahramani, 2002) for graph-transduction. On many
small data sets (e.g., Table 1 data sets have up to 200 examples) BB is able to return the globally
optimal solution in reasonable amount of time. We point the reader to Chapelle et al. (2006c) for
pseudocode.
3.2 S3 VMlight
S3 VMlight (Joachims, 1999) refers to the first S3 VM algorithm implemented in the popular SVMlight
software.3 It is based on local combinatorial search guided by a label switching procedure. The
vector yu is initialized as the labeling given by an SVM trained on the labeled set, thresholding
outputs so that u × r unlabeled examples are positive. Subsequent steps in the algorithm comprise
of switching labels of two examples in opposite classes, thus always maintaining the balance constraint. Consider an iteration of the algorithm where yu is the temporary labeling of the unlabeled
data and let (w̃, b̃) = argminw,b I (w, b, yu ) and J (yu ) = I (w̃, b̃, yu ). Suppose a pair of unlabeled
examples indexed by (i, j) satisfies the following condition, 4
yi = 1, y j = −1,V (1, oi ) +V (−1, o j ) > V (−1, oi ) +V (1, o j )
(6)
where oi , o j are outputs of (w̃, b̃) on the examples xi , x j . Then after switching labels for this
pair of examples and retraining, the objective function J can be easily shown to strictly decrease.
S3 VMlight alternates between label-switching and retraining. Since the number of possible y u is
finite, the procedure is guaranteed to terminate in a finite number of steps at a local minima of (4),
that is, no further improvements are possible by interchanging two labels.
In an outer loop, S3 VMlight gradually increases the value of C ? from a small value to the final
value. Since C ? controls the non-convex part of the objective function (4), this annealing loop can
be interpreted as implementing a “smoothing” heuristic as a means to protect the algorithm from
sub-optimal local minima. The pseudocode is provided in Algorithm 1.
3.3 Deterministic Annealing S3 VM
Deterministic annealing (DA) is a global optimization heuristic that has been used to approach hard
combinatorial or non-convex problems. In the context of S 3 VMs (Sindhwani et al., 2006), it consists
of relaxing the discrete label variables yu to real-valued variables pu = (pl+1 , . . . , pl+u ) where pi is
interpreted as the probability that yi = 1. The following objective function is now considered:
I 0 (w, b, pu ) = E [I (w, b, yu )]
=
(7)
l
n
1
kwk2 +C ∑ V (yi , oi ) +C? ∑ piV (1, oi ) + (1 − pi )V (−1, oi )
2
i=1
i=l+1
2. Note that in this SVM training, the loss terms associated with (originally) labeled and (currently labeled) unlabeled
examples are weighted by C and C ? respectively.
3. Note that in the S3 VM literature, this particular implementation is often referred as “TSVM” or “Transductive SVM”.
4. This switching condition is slightly weaker than that proposed by Joachims (1999).
208
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Algorithm 1 S3 VMlight
Train an SVM with the labeled points. oi ← w · xi + b.
Assign yi ← 1 to the ur largest oi , -1 to the others.
C̃ ← 10−5C?
while C̃ < C? do
repeat
Minimize (1) with {yi } fixed and C? replaced by C̃.
if ∃(i, j) satisfying (6) then
Swap the labels yi and y j
end if
until No labels have been swapped
C̃ ← min(1.5C,C? )
end while
where E denotes expectation under the probabilities pu . Note that at optimality with respect to pu ,
pi must concentrate all its mass on yi = sign(w> xi + b) which leads to the smaller of the two losses
V (1, oi ) and V (−1, oi ). Hence, this relaxation step does not lead to loss of optimality and is simply
a reformulation of the original objective in terms of continuous variables. In DA, an additional
entropy term −H(pu ) is added to the objective,
I 00 (w, b, pu ; T ) = I 0 (w, b, pu ) − T H(pu )
where H(pu ) = − ∑ pi log pi + (1 − pi ) log (1 − pi ),
i
and T ≥ 0 is usually referred to as ‘temperature’. Instead of (3), the following class balance constraint is used,
1 n
∑ pi = r.
u i=l+1
Note that when T = 0, I 00 reduces to (7) and the optimal pu identifies the optimal yu . When T = ∞,
I 00 is dominated by the entropy term resulting in the maximum entropy solution (p i = r for all i). T
parameterizes a family of objective functions with increasing degrees of non-convexity (see Figure 4
and further discussion below).
At any T , let (wT , bT , puT ) = argmin(w,b),pu I 00 (w, b, pu ; T ). This minimization can be performed
in different ways:
1. Alternating Minimization: We sketch here the procedure proposed by Sindhwani et al. (2006).
Keeping pu fixed, the minimization over (w, b) is standard SVM training—each unlabeled
example contributes two loss terms weighted by C ? pi and C? (1 − pi ). Keeping (w, b) fixed,
I 00 is minimized subject to the balance constraint 1u ∑ni=l+1 pi = r using standard Lagrangian
techniques. This leads to:
1
(8)
pi =
1 + e(gi −ν)/T
where gi = C? [V (1, oi )−V (−1, oi )] and ν, the Lagrange multiplier associated with the balance
constraint, is obtained by solving the root finding problem that arises by plugging (8) back
in the balance constraint. The alternating optimization proceeds until p u stabilizes in a KLdivergence sense. This method will be referred to as DA in the rest of the paper.
209
C HAPELLE , S INDHWANI AND K EERTHI
2. Gradient Methods: An alternative possibility5 is to substitute the optimal pu (8) as a function
of (w, b) and obtain an objective function over (w, b) for which gradient techniques can be
used:
S (w, b) := min I 00 (w, b, pu ; T ).
(9)
pu
S (w, b) can be minimized by conjugate gradient descent. The gradient of S is easy to compute. Indeed, let us denote by p∗u (w, b) the argmin of (9). Then,
n
∂p∗j (w) ∂I 0 (w, b, p?u (w))
∂S
∂I 00 ∂I 00
=
+ ∑
=
.
∂wi
∂wi j=l+1 ∂p j pu =p∗ (w,b) ∂wi
∂wi
u
{z
}
|
0
The partial derivative of I 00 with respect to p j is 0 by the definition of p∗u (w, b). The argument
goes through even in the presence of the constraint 1u ∑ pi = r; see Chapelle et al. (2002,
Lemma 2) for a formal proof. In other words, we can compute the gradient of (7) with respect
to w and consider pu fixed. The same holds for b. This method will be referred to as ∇DA in
the rest of the paper.
Figure 4 shows the effective loss terms in S associated with an unlabeled example for various
values of T . In an outer loop, starting from a high value, T is decreased geometrically by a constant factor. The vector pu is then tightened back close to discrete values (its entropy falls below
some threshold), thus identifying a solution to the original problem. The pseudocode is provided in
Algorithm 2.
Table 2 compares the DA and ∇DA solutions as T → 0 at two different hyperparameter settings.6 Because DA does alternate minimization and ∇DA does direct minimization, the solutions
returned by them can be quite different. Since ∇DA is faster than DA, we only report ∇DA results
in the Section 5.
Algorithm 2 DA/∇DA
Initialize pi = r i = l + 1, . . . , n
Set T = 10C? , R = 1.5, ε = 10−6 .
while H(puT ) > ε do
Solve (wT , bT , puT ) = argmin(w,b),pu I 00 (w, b, pu ; T ) subject to: 1u ∑ni=l+1 pi = r
(find local minima starting from previous solution—alternating optimization or gradient methods can be used.)
T = T /R
end while
Return wT , bT
3.4 Convex Relaxation
We follow Bie and Cristianini (2006) in this section, but outline the details for the squared Hinge
loss (see also Xu et al., 2004, for a similar derivation). Rewriting (1) as the familiar constrained
5. Strictly speaking, this approach is more along the lines of methods discussed in Section 4.
6. In Sindhwani et al. (2006), the best solution in the optimization path is returned.
210
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
1
0.5
Decreasing T
loss
0
−0.5
−1
−1.5
−2
−3
−2
−1
0
1
2
3
output
Figure 4: DA parameterizes a family of loss functions (over unlabeled examples) where the degree
of non-convexity is controlled by T . As T → 0, the original loss function (Figure 2) is
recovered.
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
DA
6.5
13.6
22.5
38.3
3
49.1
36.8
∇DA
7
5.7
27.2
39.8
12.3
61.6
22.5
DA
8.3
6.2
11
28.6
19.2
60.3
62.1
∇DA
6.7
6.5
11
26.9
18.9
60.6
30
Table 2: Generalization performance (error rates) of the two DA algorithms. DA is the original
algorithm (alternate optimization on pu and w). ∇DA is a direct gradient optimization on
(w, b), where pu should be considered as a function of (w, b). The first two columns report
results when C? = C and the last columns report results when C ? = C/100. See Section 5
for more experimental details.
optimization problem of SVMs:
n
l
1
kwk2 +C ∑ ξ2i +C? ∑ ξ2i subject to: yi oi ≥ 1 − ξi i = 1, . . . , n.
(w,b),yu 2
i=1
i=l+1
min
Consider the associated dual problem:
n
min max
{yi }
α
∑ αi −
i=1
1 n
αi α j yi y j Ki j subject to:
2 i,∑
j=1
n
∑ αi yi = 0,
αi ≥ 0
i=1
1
1
where Ki j = x>
i x j +Di j and D is a diagonal matrix given by Dii = 2C , i = 1, . . . , l and Dii = 2C? , i =
l + 1, . . . , n.
211
C HAPELLE , S INDHWANI AND K EERTHI
Introducing an n × n matrix Γ, the optimization problem can be reformulated as:
n
min max
Γ
α
∑ αi −
i=1
under constraints
1 n
αi α j Γi j Ki j
2 i,∑
j=1
∑ αi yi = 0,
αi ≥ 0, Γ = yy> .
(10)
(11)
The objective function (10) is now convex since it is the pointwise supremum of linear functions.
However the constraint (11) is not. The idea of the relaxation is to replace the constraint Γ = yy >
by the following set of convex constraints:
Γ 0,
Γi j = yi y j , 1 ≤ i, j ≤ l,
Γii = 1, l + 1 ≤ i ≤ n.
Though the original method of Bie and Cristianini (2006) does not incorporate the class balancing constraint (3), one can additionally enforce it as u12 ∑ni, j=l+1 Γi j = (2r − 1)2 . Such a soft
constraint is also used in the continuous S3 VM optimization methods of Section 4.
The convex problem above can be solved through Semi-Definite Programming. The labels of
the unlabeled points are estimated from Γ (through one of its columns or its largest eigenvector).
This method is very expensive and scales as O((l + u2 )2 (l + u)2.5 ). It is possible to try to
optimize a low rank version of Γ, but the training remains slow even in that case. We therefore do
not conduct empirical studies with this method.
4. Continuous Optimization
In this section we consider methods which do not include y u , the labels of unlabeled examples, as
optimization variables, but instead solve suitably modified versions of (5) by continuous optimization techniques. We begin by discussing two issues that are common to these methods.
Balancing Constraint The balancing constraint (3) is relatively easy to enforce for all algorithms
presented in Section 3. It is more difficult for algorithms covered in this section. The proposed
workaround, first introduced by Chapelle and Zien (2005), is to instead enforce a linear constraint:
1 n
∑ w> xi + b = 2r̃ − 1,
u i=l+1
(12)
where r̃ = r. The above constraint may be viewed as a “relaxation” of (3). For a given r̃, an easy way
of enforcing (12) is to translate all the points such that the mean of the unlabeled points is the origin,
that is, ∑ni=l+1 xi = 0. Then, by fixing b = 2r̃ − 1, we have an unconstrained optimization problem
on w. We will assume that the xi are translated and b is fixed in this manner; so the discussion will
focus on unconstrained optimization procedures for the rest of this section. In addition to being easy
to implement, this linear constraint may also add some robustness against uncertainty about the true
unknown class ratio in the unlabeled set (see also Chen et al., 2003, for related discussion).
However, since (12) relaxes (3),7 the solutions found by algorithms in this section cannot strictly
be compared with those in Section 3. In order to admit comparisons (this is particularly important
for the empirical study in Section 5), we vary r̃ in an outer loop and do a dichotomic search on this
value such that (3) is satisfied.
7. Note that simply setting r̃ = r in (12) will not enforce (3) exactly.
212
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Primal Optimization For linear classification, the variables in w can be directly optimized. Nonlinear decision boundaries require the use of the “kernel trick” (Boser et al., 1992) using a kernel
function k(x, x0 ). While most of the methods of Section 3 can use a standard SVM solver as a subroutine, the methods of this section need to solve (5) with a non-convex loss function over unlabeled
examples. Therefore, they cannot directly use off-the-shelf dual-based SVM software. We use one
of the following primal methods to implement the techniques in this section.
Method 1 We find zi such that zi · z j = k(xi , x j ). If B is a matrix having columns zi , this can
be written in matrix form as B> B = K. The Cholesky factor of K provides one such B. This
decomposition was used for ∇S3 VM (Chapelle and Zien, 2005). Another possibility is to perform
the eigendecomposition of K as K = UΛU > and set B = Λ1/2U > . This latter case corresponds to the
kernel PCA map introduced by Schölkopf and Smola (2002, Section 14.2). Once the zi are found,
we can simply replace xi in (5) by zi and solve a linear classification problem. For more details, see
Chapelle et al. (2006a).
Method 2 We set w = ∑ni=1 βi φ(xi ) where φ denotes a higher dimensional feature map associated
with the nonlinear kernel. By the Representer theorem (Sch ölkopf and Smola, 2002), we indeed
know that the optimal solution has this form. Substituting this form in (5) and using the kernel
function yields an optimization problem with β as the variables.
Note that the centering mentioned above to implement (12) corresponds to using the modified
kernel Schölkopf and Smola (2002, page 431) defined by:
k(x, x0 ) := k(x, x0 ) −
1 n
1 n
1 n
0
k(x,
x
)
−
k(x
,
x
)
+
i
i
∑ k(xi , x j ).
∑
∑
u i=l+1
u i=l+1
u2 i, j=l+1
(13)
All the shifted kernel elements can be computed in O(n2 ) operations.
Finally, note that these methods are very general and can also be applied to algorithms of Section 3.
4.1 Concave Convex Procedure (CCCP)
The CCCP method (Yuille and Rangarajan, 2003) has been applied to S 3 VMs by Fung and Mangasarian (2001), Collobert et al. (2006), and Wang et al. (2007). The description given here is close
to that in Collobert et al. (2006).
CCCP essentially decomposes a non-convex function f into a convex component f vex and a concave component f cave . At each iteration, the concave part is replaced by a linear function (namely,
the tangential approximation at the current point) and the sum of this linear function and the convex
part is minimized to get the next iterate. The pseudocode is shown in Algorithm 3.
Algorithm 3 CCCP for minimizing f = f vex + fcave
Require: Starting point x0
t ←0
while ∇ f (xt ) 6= 0 do
xt+1 ← arg minx fvex (x) + ∇ fcave (xt ) · x
t ← t +1
end while
213
C HAPELLE , S INDHWANI AND K EERTHI
In the case of S3 VM, the first two terms in (5) are convex. Splitting the last non-convex term
corresponding to the unlabeled part as the sum of a convex and a concave function, we have:
max(0, 1 − |t|)2 = max(0, 1 − |t|)2 + 2|t| −2|t| .
{z
} | {z }
|
convex
concave
If an unlabeled point is currently classified positive, then at the next iteration, the effective (convex)
loss on this point will be

if t ≥ 1,
 0
2
(1 − t) if |t| < 1,
L̃(t) =

−4t
if t ≤ −1.
A corresponding L̃ can be defined for the case of an unlabeled point being classified negative. The
CCCP algorithm specialized to S3 VMs is given in Algorithm 4. For optimization variables we
employ method 1 given at the beginning of this section.
Algorithm 4 CCCP for S3 VMs
Starting point: Use the w obtained from the supervised SVM solution.
repeat
yi ← sign(w · xi + b), l + 1 ≤ i ≤ n.
(w, b) = arg min 12 kwk2 +C ∑li=1 max(0, 1 − yi (w · xi + b))2 +C? ∑ni=l+1 L̃(yi (w · xi + b)).
until convergence of yi , l + 1 ≤ i ≤ n.
The CCCP method given in Collobert et al. (2006) does not use annealing, that is, increasing
C? slowly in steps as in S3 VMlight to help reduce local minima problems. We have however found
performance improvements with annealing (see Table 12).
4.2 ∇S3 VM
This method is proposed by Chapelle and Zien (2005) to minimize directly the objective function
(5) by gradient descent. For optimization variables, method 1 given at the beginning of this section
is used. Since the function t 7→ max(0, 1 − |t|)2 is not differentiable, it is replaced by t 7→ exp(−st 2 ),
with s = 5 (see Figure 5), to get the following smooth optimization problem: 8
min
w,b
n
l
1
kwk2 +C ∑ max(0, 1 − yi (w · xi + b))2 +C? ∑ exp(−s(w · xi + b)2 ).
2
i=1
i=l+1
(14)
As for S3 VMlight , ∇S3 VM performs annealing in an outer loop on C ? . In the experiments we
followed the same annealing schedule as in Chapelle and Zien (2005): C ? is increased in 10 steps
to its final value. More precisely, at the ith step, C ? is set to 2i−10C?f inal .
4.3 Continuation S3 VM (cS3 VM)
Closely related to ∇S3 VM, Chapelle et al. (2006a) proposes a continuation method for minimizing
(14). Gradient descent is performed on the same objective function (with the same loss for the
8. Chapelle and Zien (2005) used s = 3 with hinge loss, p = 1 in (2), but s = 5 seems to be a better choice for quadratic
hinge loss (p = 2).
214
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Standard L2 loss
1.2
Differentiable approximation
1
Loss
0.8
0.6
0.4
0.2
0
−1.5
−1
−0.5
0
0.5
Signed output
1
1.5
Figure 5: The loss function on the unlabeled points t 7→ max(0, 1 − |t|) 2 is replaced by a differentiable approximation t 7→ exp(−5t 2 ).
unlabeled points as shown in Figure 5), but the method used for annealing is different. Instead of
slowly increasing C ? , it is kept fixed, and a continuation technique is used to transform the objective
function.
This kind of method belongs to the field of global optimization techniques (Wu, 1996). The idea
is similar to deterministic annealing (see Figure 6). A smoothed version of the objective function is
first minimized. With enough smoothing the global minimum can hopefully be easily found. Then
the smoothing is decreased in steps and the minimum is tracked—the solution found in one step
serves as the starting point for the next step. The method is continued until there is no smoothing
and so we get back to the solution of (14). Algorithm 5 gives an instantiation of the method in which
smoothing is achieved by convolution with a Gaussian, but other smoothing functions can also be
used.
Algorithm 5 Continuation method for solving minx f (x)
Require: Function f : Rd 7→ R, initial point x0 ∈ Rd
Require: Sequence γ0 >
γ > . . . γ p−1 > γ p = 0.
R 1
Let fγ (x) = (πγ)−d/2 f (x − t) exp(−ktk2 /γ)dt.
for i = 0 to p do
Starting from xi , find local minimizer xi+1 of fγi .
end for
It is not clear if one should only smooth the last non-convex term of (14) (the first two terms
are convex) or the whole objective as in Chapelle et al. (2006a). It is noteworthy that since the loss
for the unlabeled points is bounded, its convolution with a Gaussian of infinite width tends to the
zero function. In other words, with enough smoothing, the unlabeled part of the objective function
vanishes and the optimization is identical to a standard SVM.
215
C HAPELLE , S INDHWANI AND K EERTHI
large γ
smaller γ
γ=0
Figure 6: Illustration of the continuation method: the original objective function in blue has two
local minima. By smoothing it, we find one global minimum (the red star on the green
curve). By reducing the smoothing, the minimum moves toward the global minimum of
the original function.
4.4 Newton S3 VM
One difficulty with the methods described in sections 4.2 and 4.3 is that their complexity scales as
O(n3 ) because they employ an unlabeled loss function that does not have a linear part, for example,
see (14). Compared to a method like S3 VMlight (see Section 3.2), which typically scales as O(n3sv +
n2 ), this can make a large difference in efficiency when n sv (the number of support vectors) is small.
In this subsection we propose a new loss function for unlabeled points and an associated Newton
method (along the lines of Chapelle, 2007) which brings down the O(n 3 ) complexity of the ∇S3 VM
method.
To make efficiency gains we employ method 2 described at the beginning of this section (note
that method 1 requires an O(n3 ) preprocessing step) and perform the minimization on β, where
w = ∑ni=1 βi φ(xi ). Note that the βi are expansion coefficients and not the Lagrange multipliers α i
in standard SVMs. Let us consider general loss functions, ` L for the labeled points, `U for the
unlabeled points, replace w by β as the variables in (5), and get the following optimization problem,
n
l
1
min β> Kβ +C ∑ `L (yi (Ki> β + b)) +C? ∑ `U (Ki> β + b),
β 2
i=1
i=l+1
(15)
where K is the kernel matrix with Ki j = k(xi , x j ) and Ki is the ith column of K.9
As we will see in detail below, computational time is dictated by n sv , the number of points that
lie in the domain of the loss function where curvature is non-zero. With this motivation we choose
the differentiable loss function plotted in Figure 7 having several linear and flat parts which are
smoothly connected by small quadratic components. 10
9. Note that, the kernel elements used here correspond to the modified kernel elements in (13).
10. The CCCP method of Collobert et al. (2006) also considers a loss function with a flat middle part, but it was not
motivated by computational gains.
216
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−1
−0.5
0
0.5
1
Figure 7: The piecewise quadratic loss function `U and its derivative (divided by 4; dashed line).
Consider the solution of (15) with this choice of loss function. The gradient of (15) is
βi +C`0L (yi (Ki> β + b))yi 1 ≤ i ≤ l
.
Kg with gi =
βi +C? `U0 (Ki> β + b)
l +1 ≤ i ≤ n
(16)
Using a gradient based method like nonlinear conjugate gradient would sill be costly because each
evaluation of the gradient requires O(n2 ) effort. To improve the complexity when nsv is small, one
can use Newton’s method instead. Let us now go into these details.
The Hessian of (15) is
C`00L (yi (Ki> β + b)) 1 ≤ i ≤ l
K + KDK, with D diagonal, Dii =
.
(17)
C? `U00 (Ki> β + b)
l +1 ≤ i ≤ n
The corresponding Newton update is β ← β − (K + KDK)−1 Kg. The advantage of Newton optimization on this problem is that the step can be computed in O(n 3sv + n2 ) time11 as we will see below
(see also Chapelle, 2007, for a similar derivation). The number of Newton steps required is usually
a small finite constant.
A problem in performing a Newton optimization with a non-convex optimization function is that
the step might not be a descent direction because the Hessian is not necessarily positive definite. To
avoid this problem, we use the Levenberg-Marquardt algorithm (Fletcher, 1987, Algorithm 5.2.7).
Roughly speaking, this algorithm is the same as Newton minimization, but a large enough ridge is
added to the Hessian such that it becomes positive definite. For computational reasons, instead of
adding a ridge to the Hessian, we will add a constant times K.
The goal is to choose a λ ≥ 1 such that λK +KDK is positive definite and solve (λK +KDK) −1 Kg
efficiently. For this purpose, we reorder the points such that D ii 6= 0, i ≤ nsv and Dii = 0, i > nsv .
Let A be the Cholesky decomposition of K:12 A is the upper triangular matrix satisfying A> A = K.
We suppose that K (and thus A) is invertible. Let us write
λK + KDK = A> (λIn + ADA> )A.
11. Consistent with the way we defined earlier, note here that nsv is the number of “support vectors” where a support
vector is defined as a point xi such that Dii 6= 0.
12. As we will see below, we will not need the Cholesky decomposition of K but only that of of Ksv .
217
C HAPELLE , S INDHWANI AND K EERTHI
The structure of K and D implies that
λK + KDK 0 ⇔ B := λInsv + Asv Dsv A>
sv 0,
where Asv is the Cholesky decomposition of Ksv and Ksv is the matrix formed using the first nsv
rows and columns of K. After some block matrix algebra, we can also get the step as
−1 −1
Asv B Asv (gsv − λ1 Dsv Ksv,nsv gnsv )
−1
−(λK + KDK) Kg =
,
(18)
1
λ gnsv
where nsv refers to the indices of the ”non support vectors”, that is, {i, D ii = 0}. Computing this
direction takes O(n3sv + n2 ) operations. The checking of the positive definiteness of B can be done
by doing Cholesky decomposition of Ksv . This decomposition can then be reused to solve the linear
system involving B. Full details, following the ideas in Fletcher (1987, Algorithm 5.2.7), are given
in Algorithm 6.
Algorithm 6 Levenberg-Marquardt method
β ← 0.
λ ← 1.
repeat
Compute g and D using (16) and (17)
sv ← {i, Dii 6= 0} and nsv ← {i, Dii = 0}.
Asv ← Cholesky decomposition of Ksv .
Do the Cholesky decomposition of λInsv + Asv Dsv A>
sv . If it fails, λ ← 4λ and try again.
Compute the step s as given by (18).
Ω(β+s)−Ω(β)
ρ ← 1 s> (K+KDK)s+s
% If the obj fun Ω were quadratic, ρ would be 1.
> Kg .
2
If ρ > 0, β ← β + s.
If ρ < 0.25, λ ← 4λ.
If ρ > 0.75, λ ← min(1, λ2 ).
until Norm(g) ≤ ε
As discussed above, the flat part in the loss (cf. Figure 7) provides computational value by
reducing nsv . But we feel it may also possibly help in leading to better local minimum. Take
for instance a Gaussian kernel and consider an unlabeled point far from the labeled points. At
the beginning of the optimization, the output on that point will be 0. 13 This unlabeled point does
not contribute to pushing the decision boundary one way or the other. This seems like a satisfactory
behavior: it is better to wait to have more information before taking a decision on an unlabeled point
for which we are unsure. In Table 3 we compare the performance of the flat loss in Figure 7 and the
original quadratic loss used in (5). The flat loss yields a huge gain in performance on 2moons. On
the other data sets the two losses perform somewhat similarly. From a computational point of view,
the flat part in the loss can sometimes reduce the training time by a factor 10 as shown in Table 3.
5. Experiments
This section is organized around a set of empirical issues:
13. This is true only for balanced problems; otherwise, the output is b.
218
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
Error rate
Flat Quadratic
6.1
5.3
5.4
7.7
18.6
15.9
32.2
27.1
24.1
24.7
61.5
58.4
11
66.4
Training time
Flat Quadratic
15
39
2180
2165
233
2152
168
1253
152
1244
6
8
1.7
1.2
Table 3: Comparison of the Newton-S3 VM method with two different losses: the one with a flat
part in the middle (see Figure 7) and the standard quadratic loss (Figure 2). Left: error rates
on the unlabeled set; right: average training time in seconds for one split and one binary
classifier training (with annealing and dichotomic search on the threshold as explained in
the experimental section). The implementations have not been optimized, so the training
times only constitute an estimate of the relative speeds.
1. While S3 VMs have been very successful for text classification (Joachims, 1999), there are
many data sets where they do not return state-of-the-art empirical performance (Chapelle and
Zien, 2005). This performance variability is conjectured to be due to local minima problems.
In Section 5.3, we discuss the suitability of the S3 VM objective function for semi-supervised
learning. In particular, we benchmark current S3 VM implementations against the exact, globally optimal solution and we discuss whether one can expect significant improvements in
generalization performance by better approaching the global solution.
2. Several factors influence the performance and behavior of S 3 VM algorithms. In Section 5.4
we study their quality of optimization, generalization performance, sensitivity to hyperparameters, effect of annealing and the robustness to uncertainty in class ratio estimates.
3. S3 VMs were originally motivated by Transductive learning, the problem of estimating labels
of unlabeled examples without necessarily producing a decision function over the entire input
space. However, S3 VMs are also semi-supervised learners as they are able to handle unseen
test instances. In Section 5.5, we run S3 VM algorithms in an inductive mode and analyze
performance differences between unlabeled and test examples.
4. There is empirical evidence that S3 VMs exhibit poor performances on “manifold” type data
(where graph-based methods typically excel) or when the data has many distinct sub-clusters
(Chapelle and Zien, 2005). We explore the issue of data geometry and S 3 VM performance in
Section 5.6.
At the outset, we point out that this section does not provide an exhaustive cross-comparison
between algorithms. Such a comparison would require, say, cross-validation over multiple hyperparameters, randomization over choices of labels, dichotomic search to neutralize balance constraint
differences and handling different choices of annealing sequences. This is computationally quite
demanding and, more seriously, statistically brittle due to the lack of labeled validation data in
219
C HAPELLE , S INDHWANI AND K EERTHI
semi-supervised tasks. Our goal, therefore, is not so much to establish a ranking of algorithms
reviewed in this paper, but rather to observe their behavior under a neutral experimental protocol.
We next describe the data sets used in our experiments. Note that our choice of data sets is
biased towards multi-class manifold-like problems which are particularly challenging for S 3 VMs.
Because of this choice, the experimental results do not show the typically large improvements one
might expect over standard SVMs. We caution the reader not to draw the conclusion that S 3 VM is a
weak algorithm in general, but that it often does not return state-of-the-art performance on problems
of this nature.
5.1 Data Sets
Most data sets come from Chapelle and Zien (2005). They are summarized in Table 4.
data set
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
classes
2
2
10
9
20
3
2
dims
50
7511
256
617
1024
1024
102
points
550
1946
2007
1620
1440
216
200
labeled
50
50
50
50
40
6
2
Table 4: Basic properties of benchmark data sets.
The artificial data set g50c is inspired by Bengio and Grandvalet (2004): examples are generated
from two standard normal multi-variate Gaussians, the labels correspond to the Gaussians, and
the means are located in 50-dimensional space such that the Bayes error is 5%. The real world
data sets consist of two-class and multi-class problems. The Text data set is defined using the
classes mac and mswindows of the Newsgroup20 data set preprocessed as in Szummer and Jaakkola
(2001). The Uspst set contains the test data part of the well-known USPS data on handwritten
digit recognition. The Isolet is a subset of the ISOLET spoken letter database (Cole et al., 1990)
containing the speaker sets 1, 2 and 3 and 9 confusing letters {B,C,D,E,G,P,T,V,Z}. In Coil20
(respectively Coil3), the data are gray-scale images of 20 (respectively 3) different objects taken
from different angles, in steps of 5 degrees (Nene et al., 1996). The Coil3 data set has been used
first by Chapelle et al. (2006c) and is particularly difficult since the 3 classes are 3 cars which look
alike (see Figure 8).
Figure 8: The 3 cars from the COIL data set, subsampled to 32×32
220
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Finally, 2moons has been used extensively in the semi-supervised learning literature (see for
instance Zhu and Ghahramani, 2002, and Figure 1). For this data set, the labeled points are fixed
and new unlabeled points are randomly generated for each repetition of the experiment.
5.2 Experimental Setup
To minimize the influence of external factors, unless stated otherwise, the experiments were run in
the following normalized way:
Hyperparameters The Gaussian kernel k(x, y) = exp(−kx − yk 2 /2σ2 ) was used. For simplicity
the constant C? in (1) was set to C.14 The same hyperparameters C and σ have been used
for all the methods. They are found with cross-validation by training an inductive SVM on
the entire data set (the unlabeled points being assigned their real label). These values are
reported in Table 5. Even though it seems fair to compare the algorithms with the same
hyperparameters, there is a possibility that some regime of hyperparameter settings is more
suitable for a particular algorithm than for another. We discuss this issue in section 5.4.3.
Multi-class Data sets with more than two classes are learned with a one-versus-the-rest approach.
The reported objective value is the mean of the objective values of the different classifiers.
We also conducted experiments on pair-wise binary classification problems (cf. Section 5.6.1)
constructed from the multi-class data sets.
Objective value Even though the objective function that we want to minimize is (5), some algorithms like ∇S3 VM use another (differentiable) objective function. In order to compare the
objective values of the different algorithms, we do the following: after training, we predict the
labels of the unlabeled points and train a standard SVM on this augmented labeled set (with
p = 2 in (2) and C, C ? weightings on originally labeled and unlabeled terms respectively).
The objective value reported is the one of this SVM.
Balancing constraint For the sake of simplicity, we set r in the balancing constraint (3) to the true
ratio of the positive points in the unlabeled set. This constraint is relatively easy to enforce
for all algorithms presented in Section 3. It is more difficult for algorithms in Section 4 and,
for them we used the dichotomic search described at the beginning of Section 4.
For a given data set, we randomly split the data into a labeled set and an unlabeled set. We refer
to the error rate on the unlabeled set as the unlabeled error to differentiate it from the test error
which would be computed on an unseen test set. Results are averaged over 10 random splits. The
difference between unlabeled and test performance is discussed in Section 5.5.
5.3 Suitability of the S3 VM Objective Function
Table 1 shows unlabeled error rates for common S3 VM implementations on two small data sets
Coil3 and 2moons. On these data sets, we are able to also run Branch-and-Bound and get the true
globally optimal solution. We see that the global optimum corresponds to a perfect solution, while
the local minima found by approximate implementations yield very poor accuracies. From these
results, it appears that the minimization of the S3 VM objective function makes good sense, even
14. Alternatively, one could set C ? = C ul to have equal contribution from labeled and unlabeled points.
221
C HAPELLE , S INDHWANI AND K EERTHI
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
σ
38
3.5
7.4
15
2900
3000
0.5
C
19
31
38
43
37
100
10
Table 5: Values of the hyperparameters used in the experiments.
though the performance of practical S3 VM may not consistently reflect this due to local minima
problems.
Table 6 records the rank correlation between unlabeled error and objective function. The rank
correlation has been computed in the following way. For each split, we take 10 different solutions
and compute the associated unlabeled error and objective value. Ideally, we would like to sample these solutions at random around local minima. But since it is not obvious how to do such a
sampling, we simply took the solution given by the different S 3 VM algorithms as well as 4 “intermediate” solutions obtained as follows. A standard SVM is trained on the original labeled set and
a fraction of the unlabeled set (with their true labels). The fraction was either 0, 10, 20 or 30%.
The labels of the remaining unlabeled points are assigned through a thresholding of the real value
outputs of the SVM. This threshold is such that the balancing constraint (3) is satisfied. Finally, an
SVM is retrained using the entire training set. By doing so, we “sample” solutions varying from an
inductive SVM trained on only the labeled set to the optimal solution. Table 6 provides evidence
that the unlabeled error is correlated with the objective values.
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
Coefficient
0.2
0.67
0.24
0.23
0.4
0.17
0.45
Table 6: Kendall’s rank correlation (Abdi, 2006) between the unlabeled error and the objective
function averaged over the 10 splits (see text for details).
5.4 Behavior of S3 VM Algorithms
Several factors influence the performance and behavior of S 3 VM algorithms. We study their quality
of optimization, generalization performance, sensitivity to hyperparameters, effect of annealing and
the robustness to uncertainty in class ratio estimates.
222
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
5.4.1 Q UALITY
OF
M INIMIZATION
In Table 7 we compare different algorithms in terms of minimization of the objective function.
∇S3 VM and cS3 VM appear clearly to be the methods achieving the lowest objective values. However, as we will see in the next section, this does not necessarily translate into lower unlabeled
errors.
∇S3 VM
1.7
cS3 VM
1.9
CCCP
4.5
S3 VMlight
4.9
∇DA
4.3
Newton
3.7
Table 7: For each data set and each split, the algorithms were ranked according to the objective
function value they attained. This table shows the average ranks. These ranks are only
about objective function values; error rates are discussed below.
5.4.2 Q UALITY
OF
G ENERALIZATION
Table 8 reports the unlabeled errors of the different algorithms on our benchmark data sets. A first
observation is that most algorithms perform quite well on g50c and Text. However, the unlabeled
errors on the other data sets, Uspst, Isolet, Coil20, Coil3, 2moons (see also Table 1) are poor and
sometimes even worse than a standard SVM. As pointed out earlier, these data sets are particularly
challenging for S3 VMs. Moreover, the honors are divided and no algorithm clearly outperforms the
others. We therefore cannot give a recommendation on which one to use.
g50c
Text
Uspst
Isolet
Coil20
∇S3 VM
cS3 VM
CCCP
S3 VMlight
∇DA
Newton
SVM
SVM-5cv
6.7
5.1
15.6
25.8
25.6
6.4
5.3
36.2
59.8
30.7
6.3
8.3
16.4
26.7
26.6
6.2
8.1
15.5
30
25.3
7
5.7
27.2
39.8
12.3
6.1
5.4
18.6
32.2
24.1
8.2
14.8
20.7
32.7
24.1
4.9
2.8
3.9
6.4
0
Table 8: Unlabeled errors of the different S3 VMs implementations. The next to the last column is an
SVM trained only on the labeled data, while the last column reports 5 fold cross-validation
results of an SVM trained on the whole data set using the labels of the unlabeled points.
The values in these two columns can be taken as upper and lower bounds on the best
achievable error rates. See Table 1 for results on Coil3 and 2moons.
Note that these results may differ from the ones previously reported by Chapelle and Zien
(2005), Collobert et al. (2006), Sindhwani et al. (2006), and Chapelle et al. (2006a) on the same
data sets. Most of this difference comes from the choice of the hyperparameters. Indeed, as explained below, several algorithms are rather sensitive to the choice of hyperparameters. The exact
experimental setting is also different. In results reported elsewhere, r is often estimated from the labeled set and the constraint is often a “soft” one. In Table 8, the hard balance constraint is enforced
for all methods assuming r to be known exactly. Finally, in Chapelle et al. (2006a), only pair-wise
223
C HAPELLE , S INDHWANI AND K EERTHI
binary problems were considered for the cS3 VM algorithm, while results in Table 8 for multiclass
data sets are obtained under a one-versus-the-rest setup.
5.4.3 S ENSITIVITY
TO
H YPERPARAMETERS
As mentioned before, it is possible that different algorithms excel in different hyperparameter
regimes. This is more likely to happen due to better local minima at different hyperparameters
as opposed to a better global solution (in terms of error rates).
Due to computational constraints, instead of setting the three hyperparameters (C,C ? and the
Gaussian kernel width, σ) by cross-validation for each of the methods, we explored the influence
of these hyperparameters on one split of the data. More specifically, Table 9 shows the relative
improvement (in %) that one can gain by selecting other hyperparameters. These numbers may
be seen as a measure of the robustness of a method. Note that these results have to be interpreted
carefully because they are only on one split: for instance, it is possible that“by chance” the method
did not get stuck in a bad local minimum for one of the hyperparameter settings, leading to a larger
number in Table 9.
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
Mean
∇S3 VM
31.2
22
12
7.6
46.4
32.6
45.6
28.2
cS3 VM
31.2
7.1
70.2
57.6
42.7
39.2
50
42.6
CCCP
27.8
29.2
19.2
8
27.9
15.3
54.1
25.9
S3 VMlight
13.3
19.9
0
0
5.8
20.2
13.5
10.4
∇DA
35
34.4
41
4.9
5.7
5.8
23.5
21.5
Newton
7.7
1.1
15.6
2.6
16.9
24.6
0
9.8
Table 9: On the 1st split of each data set, 27 set of hyperparameters (σ 0 ,C0 ,C? 0 ) have been tested
1
1
1 0 0
C,C, 10C},C? 0 ∈ { 100
C0 , 10
C ,C }. The table shows the
from σ0 ∈ { 21 σ, σ, 2σ},C0 ∈ { 10
relative improvement (in %) by taking the best hyperparameters over default ones.
By measuring the variation of the unlabeled errors with respect to the choice of the hyperparameters, Table 10 records an indication of the sensitivity of the method with respect to that choice.
From this point of view S3 VMlight appears to be the most stable algorithm.
∇S3 VM
6.8
cS3 VM
8.5
CCCP
6.5
S3 VMlight
2.7
∇DA
8.4
Newton
8.7
Table 10: The variance of the unlabeled errors have been averaged over the 27 possible hyperparameters (cf. Table 9). The table shows those variance averaged over the data sets. A
small number shows that a given method is not too sensitive to the choice of the hyperparameters.
224
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Finally, from the experiments in Table 9, we observed that in some cases a smaller value of C ?
is helpful. We have thus rerun the algorithms on all the splits with C ? divided by 100: see Table 11.
The overall performance does not necessarily improve, but the very poor results become better (see
for instance Uspst,Isolet,Coil20 for cS3 VM).
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
∇S3 VM
8.3
5.7
14.1
27.8
23.9
60.8
65.0
cS3 VM
8.3
5.8
15.6
28.5
23.6
55.0
49.8
CCCP
8.5
8.5
14.9
25.3
23.6
56.3
66.3
S3 VMlight
8.4
8.1
14.5
29.1
21.8
59.2
68.7
∇DA
6.7
6.5
11
26.9
18.9
60.6
30
Newton
7.5
14.5
19.2
32.1
24.6
60.5
33.5
Table 11: Same as Table 8, but with C ? divided by 100.
5.4.4 E FFECT OF A NNEALING S CHEDULE
All the algorithms described in this paper use some sort of annealing (e.g., gradually decreasing
T in DA or increasing C ? in S3 VMlight in an outer loop) where the role of the unlabeled points is
progressively increased. The three ingredients to fix are:
1. The starting point which is usually chosen in such a way that the unlabeled points have very
little influence and the problem is thus almost convex.
2. The stopping criterion which should be such that the annealed objective function and the
original objective function are very close.
3. The number of steps. Ideally, one would like to have as many steps as possible, but for
computational reasons the number of steps is limited.
In the experimental results presented in Figure 9, we only varied the number of steps. The starting and final values are as indicated in the description of the algorithms. The original CCCP paper
did not have annealing and we used the same scheme as for ∇S 3 VM: C? is increased exponentially
from 10−3C to C. For DA and ∇DA, the final temperature is fixed at a small constant and the decrease rate R is such that we have the desired number of steps. For all algorithms, one step means
that there is no annealing.
One has to be cautious when drawing conclusion from this plot. Indeed, the results are only for
one data set, one split and fixed hyperparameters. The goal is to get an idea of whether the annealing
for a given method is useful; and if so, how many steps should be taken. From this plot, it seems
that:
• All methods, except Newton, seem to benefit from annealing. However, on some other data
sets, annealing improved the performances of Newton’s method (results not shown).
• Most methods do not require a lot of steps. More precisely, we have noticed that the number
of steps does not matter as much as the minimization around a “critical” value of the annealing
parameter; if that point is missed, then the results can be bad.
225
C HAPELLE , S INDHWANI AND K EERTHI
0.4
3
∇−S VM
3
cS VM
CCCP
light
SVM
DA
Newton
∇DA
Test error
0.3
0.2
0.1
0
1
2
4
8
Number of steps
16
32
Figure 9: Influence of the number of annealing steps on the first split of Text.
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
No annealing
7.9
14.7
21.3
32.4
26.1
49.3
67.1
Annealing
6.3
8.3
16.4
26.7
26.6
56.6
63.1
Table 12: Performance of CCCP with and without annealing. The annealing schedule is the same
as the one used for ∇S3 VM and Newton: C ? is increased in 10 steps from its final value
divided by 1000.
• DA seems the method which relies the most on a relatively slow annealing schedule, while
∇DA can have a faster annealing schedule.
• CCCP was originally proposed without annealing, but it seems that its performance can be
improved by annealing. To confirm this fact, Table 12 compares the results of CCCP with
and without annealing on the 10 splits of all the data sets.
The results provided in Table 8 are with annealing for all methods.
5.4.5 ROBUSTNESS
TO
C LASS R ATIO E STIMATE
Several results reported in the previous tables are better than, for instance, the results in Chapelle
and Zien (2005); Chapelle et al. (2006a). This is because (a) we took into account the knowledge
of the true class ratio among the unlabeled examples, and (b) we enforced the constraint (3) exactly
(with the dichotomic search described at the beginning of Section 4).
226
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Of course, in practice the true number of positive points is unknown. One can estimate it from
the labeled points as:
!
1 1 l
r=
∑ yi + 1 .
2 l i=1
Table 13 presents the generalization performances in this more realistic context where class ratios
are estimated as above and original soft balance constraint is used where applicable (recall the
discussion in Section 4).
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
∇S3 VM
7.2
6.8
24.1
48.4
35.4
64.4
62.2
cS3 VM
6.6
5
41.5
58.3
51.5
59.7
33.7
CCCP
6.7
12.8
24.3
43.8
34.5
59.4
55.6
S3 VMlight
7.5
9.2
24.4
36
25.3
56.7
68.8
∇DA
8.4
8.1
29.8
46
12.3
61.7
22.5
Newton
5.8
6.1
25
45.5
25.4
62.9
8.9
SVM
9.1
23.1
24.2
38.4
26.2
51.8
44.4
Table 13: Constant r estimated from the labeled set. For methods of Section 4, the original constraint (12) is used; there is no dichotomic search (see beginning of Section 4).
5.5 Transductive Versus Semi-supervised Learning
S3 VMs were introduced as Transductive SVMs, originally designed for the task of directly estimating labels of unlabeled points. However S3 VMs provide a decision boundary in the entire input
space, and so they can provide labels of unseen test points as well. For this reason, we believe
that S3 VMs are inductive semi-supervised methods and not strictly transductive. A discussion on
the differences between semi-supervised learning and transduction can be found in Chapelle et al.
(2006b, Chapter 25).15
We expect S3 VMs to perform equally well on the unlabeled set and on an unseen test set. To
test this hypothesis, we took out 25% of the unlabeled set that we used as a unseen test set. We
performed 420 experiments (6 algorithms, 7 data sets and 10 splits). Based on these 420 pairs of
error rates, we did not observe a significant difference at the 5% confidence level. Also, for each of
the 7 data sets (resp 6 algorithms), there was no statistical significant differences in the 60 (resp 70)
pairs of error rates.
Similar experiments were performed by Collobert et al. (2006, Section 6.2). Based on 10 splits,
the error rate was found to be better on the unlabeled set than on the test set. The authors made
the hypothesis that when the test and training data are not identically distributed, transduction can
be helpful. Indeed, because of small sample effect, the unlabeled and test set could appear as not
coming from the same distribution (this is even more likely in high dimension).
We considered the g50c data set and biased the split between unlabeled and test set such that
there is an angle between the principal directions of the two sets. Figure 10 shows a correlation
15. Paper is available at http://www.kyb.tuebingen.mpg.de/ssl-book/discussion.pdf.
227
C HAPELLE , S INDHWANI AND K EERTHI
0.05
Error difference
0
−0.05
−0.1
−0.15
−0.2
0.4
0.5
0.6
0.7
0.8
Angle
0.9
1
1.1
Figure 10: The test set of g50c has been chosen with a bias such that there is an angle between
the principal directions of the unlabeled and test sets. The figure shows the difference
in error rates (negative means better error rate on the unlabeled set) as a function of the
angle for different random (biased) splits.
between the difference in error rates and this angle: the error on the test set deteriorates as the angle
increases. This confirms the hypothesis stated above.
5.6 Role of Data Geometry
There is empirical evidence that S3 VMs exhibit poor performances on “manifold” type data or when
the data has many distinct sub-clusters (Chapelle and Zien, 2005). We now explore this issue and
propose an hybrid method combining the S3 VM and LapSVM (Sindhwani et al., 2005).
5.6.1 E FFECT OF M ULTIPLE C LUSTERS
In Chapelle et al. (2006a), cS3 VM exhibited poor performance in multiclass problems with oneversus-the-rest training, but worked well on pairwise binary problems that were constructed (using
all the labels) from the same multiclass data sets. Note that in practice it is not possible to do
semi-supervised one-vs-one multiclass training because the labels of the unlabeled points are truly
unknown.
We compared different algorithms on pairwise binary classification tasks for all the multiclass
problems. Results are shown in the first 4 rows of Table 14. Except on Coil3, most S 3 VM algorithms show improved performances. There are two candidate explanations for this behavior:
1. The binary classification problems in one-versus-the-rest are unbalanced and this creates difficulties for S3 VMs.
228
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Uspst
Isolet
Coil20
Coil3
Uspst2
∇S3 VM
1.9
4.8
2.8
45.8
15.6
cS3 VM
2
11.8
3.9
48.7
25.7
CCCP
2.8
5.5
2.9
40.3
16.6
S3 VMlight
2.9
5.7
3.1
41.3
16.1
∇DA
4
5
2.7
44.5
20
Newton
1.9
6.3
2.4
47.2
16
SVM
5.3
7.3
3.3
36.7
17.7
SVM-5cv
0.9
1.2
0
0
3
Table 14: Experiments in a pairwise binary setting. Uspst2 is the same set as Uspst but where the
task is to classify digits 0 to 4 versus 5 to 9.
2. In a one-versus-the-rest approach, the negative class is the concatenation of several classes
and is thus made up of several clusters. This might accentuate the local minimum problem of
S3 VMs.
To test these hypothesis, we created a binary version of Uspst by classifying digits 0 to 4 versus 5
to 9: this data set (Uspst2 in Table 14) is balanced but each class is made of several clusters. The
fact that the S3 VMs algorithms were not able to perform significantly better than the SVM baseline
tends to accredit the second hypothesis: S3 VM results deteriorate when there are several clusters
per class.
5.6.2 H YBRID S 3 VM- GRAPH M ETHODS
Recall that the results in Table 8 boost the empirical evidence that S 3 VMs do not return stateof-the-art performance on “manifold”-like data sets. On these data sets the cluster assumption
is presumably satisfied under an intrinsic “geodesic” distance rather than the original euclidean
distance between data points. It is reasonable, therefore, to attempt to combine S 3 VMs with choices
of kernels that conform to the particular geometry of the data.
LapSVM (Sindhwani et al., 2005) is a popular semi-supervised algorithm in which a kernel
function is constructed from the Laplacian of an adjacency graph that models the data geometry;
this kernel is then used with a standard SVM. This procedure was shown to be very effective on data
sets with a manifold structure. In Table 15 we report results with a hybrid S 3 VM-graph method: the
kernel is constructed as in LapSVM,16 but then it is plugged into S3 VMlight . Such a hybrid was first
described in Chapelle et al. (2006a) where cS3 VM was combined with LapSVM.
The results of this hybrid method is very satisfactory, often outperforming both LapSVM and
S3 VMlight .
Such a method complements the strengths of both S3 VM and Graph-based approaches: the
adds robustness to the construction of the graph, while the graph enforces the right cluster
structure to alleviate local minima problems in S3 VMs. We believe that this kind of technique is
probably one of the most robust and powerful way for a semi-supervised learning problem.
S3 VM
16. Hyperparameters were chosen based on experiments in Sindhwani et al. (2005) without any extensive tuning.
229
C HAPELLE , S INDHWANI AND K EERTHI
g50c
Text
Uspst
Isolet
Coil20
Coil3
2moons
LapSVM
6.4
11
11.4
41.2
11.9
20.6
7.8
Exact r
S3 VMlight
6.2
8.1
15.5
30.0
25.3
56.7
68.8
(Table 8 setting)
LapSVM-S3 VMlight
4.6
8.3
8.8
46.5
12.5
17.9
5.1
Estimated r
S3 VMlight
7.5
9.2
24.4
36.0
25.3
56.7
68.8
(Table 13 setting)
LapSVM-S3 VMlight
6.1
9.0
19.6
49.0
12.5
17.9
5.1
Table 15: Comparison of a Graph-based method, LapSVM (Sindhwani et al., 2005), with
S3 VMlight and hybrid LapSVM-S3 VMlight results under the settings of Table 8 and 13.
6. A Note on Computational Complexity
Even though a detailed comparison of the computational complexity of the different algorithms is
out of the scope of this paper, we can still give the following rough picture.
First, all methods use annealing and so the complexity depends on the number of annealing
steps. This dependency is probably sublinear because when the number of steps is large, retraining
after each (small) step is less costly. We can divide the methods in two categories:
1. Methods whose complexity is of the same order as that of a standard SVM trained with
the predicted labels of the unlabeled points, which is O(n 3sv + n2 ) where nsv is the number of
points which are in a non-linear part of the loss function. This is clearly the case for S 3 VMlight
since it relies on an SVM optimization. Note that the training time of this algorithm can be
sped up by swapping labels in “batch” rather than one by one (Sindhwani and Keerthi, 2006).
CCCP is also in this category as the optimization problem it has to solve at each step is
very close to an SVM. Finally, even if the Newton method is not directly solved via SVM, its
complexity is also O(n3sv + n2 ) and we include it in this category. For both CCCP and Newton,
the possibility of having a flat part in the middle of the loss function can reduce n sv and thus
the complexity. We have indeed observed with the Newton method that the convergence can
be an order of magnitude faster when the loss includes this flat part in the middle (see Table 3).
2. Gradient based methods, namely ∇S3 VM, cS3 VM and ∇DA do not have any linear part in
the objective function and so they scale as O(n3 ). Note that it is possible to devise a gradient
based method and a loss function that contains some linear parts. The complexity of such an
algorithm would be O(nn2sv + n2 ).
The original DA algorithm alternates between optimization of w and p u and can be understood
as block coordinate optimization. We found that it was significantly slower than the other algorithms; its direct gradient-based optimization counterpart, ∇DA, is usually much faster.
Finally, note that even if the algorithms of the second category have a complexity of O(n 3 ), one
can find an approximate solution by reducing the dimensionality of the problem from n to m and
get a complexity of O(nm2 ). For instance, Chapelle et al. (2006a) reports a speed-up of 100 times
without loss in accuracy for cS3 VM.
230
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
Ultimately a semi-supervised learning algorithm should be able to handle data sets with millions
of unlabeled points. The best way of scaling up S3 VMs is still an open question and should be the
topic of future research.
7. Conclusion
When practical S3 VM implementations fail to give good results on a problem, one might suspect
that either: (a) the cluster assumption does not hold; or, (b) the cluster assumption holds but local minima problems are severe; or, (c) the S3 VM objective function is unable to implement the
cluster assumption. We began our empirical study by benchmarking current S 3 VM implementations against a global optimizer. Our results (see Section 5.3) narrowed the cause for performance
loss down to suboptimal local minima, and established a correlation between generalization performance and the S3 VM objective function. For problems where the cluster assumption is true, we
expect the S3 VM objective to indeed be an appropriate quantity to minimize. Due to non-convexity
however, this minimization is not completely straightforward—an assortment of optimization techniques have been brought to bear on this problem with varying degrees of success. In this paper,
we have reviewed these techniques and studied them empirically, taking several subtle differences
into account. In a neutral experimental protocol, we were unable to identify any single technique
as being consistently superior to another in terms of generalization performance. We believe better
methods are still needed to optimize the S3 VM objective function.
While S3 VMs return good performance on textual data sets, they are currently not competitive
with graph-methods on domains such as image classification often characterized by multiple, highly
non-Gaussian clusters. A particularly promising class of techniques (see Section 5.6.2) is based on
combining S3 VMs with graph methods.
S3 VMs have been sparingly explored in domains other than text and image classification. New
application domains may provide additional insight into the behavior of S 3 VM methods while enhancing their general appeal for semi-supervised learning. Another major theme is the extension of
S3 VMs for structured output problems, possibly building on one of several recent lines of work for
handling complex supervised learning tasks. A first step towards such an extension would require a
clear statement of the cluster assumption applicable to the semi-supervised structured output setting.
These are fertile areas for future research.
References
H. Abdi. Kendall rank correlation. In N.J. Salkind, editor, Encyclopedia of Measurement and
Statistics. SAGE, 2006.
A. Astorino and A. Fuduli. Nonsmooth optimization techniques for semi-supervised classification.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2135–2142, 2007.
Y. Bengio and Y. Grandvalet. Semi-supervised learning by entropy minimization. In Advances in
Neural Information Processing Systems, volume 17, 2004.
K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural
Information Processing Systems 12, 1998.
231
C HAPELLE , S INDHWANI AND K EERTHI
T. De Bie and N. Cristianini. Semi-supervised learning using semi-definite programming. In
O. Chapelle, B. Schoëlkopf, and A. Zien, editors, Semi-supervised Learning. MIT Press, 2006.
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers.
In Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, New
York, NY, 1992.
O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5):1155–
1178, 2007.
O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Tenth International Workshop on Artificial Intelligence and Statistics, 2005.
O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support
vector machines. Machine Learning, 46(1-3):131–159, 2002.
O. Chapelle, M. Chi, and A. Zien. A continuation method for semi-supervised SVMs. In International Conference on Machine Learning, 2006a.
O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge,
MA, 2006b. URL http://www.kyb.tuebingen.mpg.de/ssl-book.
O. Chapelle, V. Sindhwani, and S. Keerthi. Branch and bound for semi-supervised support vector
machines. In Advances in Neural Information Processing Systems, 2006c.
Y. Chen, G. Wang, and S. Dong. Learning with progressive transductive support vector machine.
Pattern Recognition Letter, 24(12):1845–1855, 2003.
R. Cole, Y. Muthusamy, and M. Fanty. The ISOLET spoken letter database. Technical Report CS/E
90-004, Oregon Graduate Institute, 1990.
R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine
Learning Research, 7:1687–1712, 2006.
R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, 1987.
G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15:29–44, 2001.
T. Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, 1999.
H. Liu and S.-T. Huang. Fuzzy transductive support vector machines for hypertext classification.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 12(1):21–36,
2004.
S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-20). Technical Report
CUCS-005-96, Columbia Univ., USA, 1996.
B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
232
O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES
M. Seeger. A taxonomy of semi-supervised learning methods. In O. Chapelle, B. Sch ölkopf, and
A. Zien, editors, Semi-Supervised Lerning. MIT Press, 2006.
M. Silva, T. Maia, and A. Braga. An evolutionary approach to transduction in support vector machines. In Fifth International Conference on Hybrid Intelligent Systems, pages 329–334, 2005.
V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR, 2006.
V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: From transductive to semisupervised learning. In International Conference on Machine Learning, 2005.
V. Sindhwani, S. Keerthi, and O. Chapelle. Deterministic annealing for semi-supervised kernel
machines. In International Conference on Machine Learning, 2006.
M. Szummer and T. Jaakkola. Partially labeled classification with markov random walks. In Advances in Neural Information Processing Systems, volume 14, 2001.
V. Vapnik and A. Sterin. On structural risk minimization or overall risk in a problem of pattern
recognition. Automation and Remote Control, 10(3):1495–1503, 1977.
V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., New York, 1998.
L. Wang, X. Shen, and W. Pan. On transductive support vector machines. In J. Verducci, X. Shen,
and J. Lafferty, editors, Prediction and Discovery. American Mathematical Society, 2007.
W. Wapnik and A. Tscherwonenkis. Theorie der Zeichenerkennung. Akademie Verlag, Berlin,
1979.
Z. Wu. The effective energy transformation scheme as a special continuation approach to global
optimization with application to molecular conformation. SIAM Journal on Optimization, 6(3):
748–768, 1996.
L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in
Neural Information Processing Systems, 2004.
A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936,
2003.
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.
Technical Report 02-107, CMU-CALD, 2002.
233