Journal of Machine Learning Research 9 (2008) 203-233 Submitted 5/07; Published 2/08 Optimization Techniques for Semi-Supervised Support Vector Machines Olivier Chapelle∗ CHAP @ YAHOO - INC . COM Yahoo! Research 2821 Mission College Blvd Santa Clara, CA 95054 Vikas Sindhwani VIKASS @ CS . UCHICAGO . EDU University of Chicago, Department of Computer Science 1100 E 58th Street Chicago, IL 60637 Sathiya S. Keerthi SELVARAK @ YAHOO - INC . COM Yahoo! Research 2821 Mission College Blvd Santa Clara, CA 95054 Editor: Nello Cristianini Abstract Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning 1. Introduction In many applications of machine learning, abundant amounts of data can be cheaply and automatically collected. However, manual labeling for the purposes of training learning algorithms is often a slow, expensive, and error-prone process. The goal of semi-supervised learning is to employ the large collection of unlabeled data jointly with a few labeled examples for improving generalization performance. The design of Support Vector Machines (SVMs) that can handle partially labeled data sets has naturally been a vigorously active subject. A major body of work is based on the following idea: solve the standard SVM problem while treating the unknown labels as additional optimization variables. By maximizing the margin in the presence of unlabeled data, one learns a decision boundary that traverses through low data-density regions while respecting labels in the input space. In other words, this approach implements the cluster assumption for semi-supervised learning—that is, ∗. Most of the work was done while at MPI for Biological Cybernetics,T übingen, Germany. c 2008 Olivier Chapelle, Vikas Sindhwani and Sathiya S. Keerthi. C HAPELLE , S INDHWANI AND K EERTHI Figure 1: Two moons. There are 2 labeled points (the triangle and the cross) and 100 unlabeled points. The global optimum of S3 VM correctly identifies the decision boundary (black line). points in a data cluster have similar labels (Seeger, 2006; Chapelle and Zien, 2005). Figure 1 illustrates a low-density decision surface implementing the cluster assumption on a toy two-dimensional data set. This idea was first introduced by Vapnik and Sterin (1977) under the name Transductive SVM, but since it learns an inductive rule defined over the entire input space, we refer to this approach as Semi-Supervised SVM (S3 VM). Since its first implementation by Joachims (1999), a wide spectrum of techniques have been applied to solve the non-convex optimization problem associated with S 3 VMs, for example, local combinatorial search (Joachims, 1999), gradient descent (Chapelle and Zien, 2005), continuation techniques (Chapelle et al., 2006a), convex-concave procedures (Fung and Mangasarian, 2001; Collobert et al., 2006), semi-definite programming (Bie and Cristianini, 2006; Xu et al., 2004), non-differentiable methods (Astorino and Fuduli, 2007), deterministic annealing (Sindhwani et al., 2006), and branch-and-bound algorithms (Bennett and Demiriz, 1998; Chapelle et al., 2006c). While non-convexity is partly responsible for this diversity of methods, it is also a departure from one of the nicest aspects of SVMs. Table 1 benchmarks the empirical performance of various S3 VM implementations against the globally optimal solution obtained by a Branch-and-Bound algorithm. These empirical observations strengthen the conjecture that the performance variability of S3 VM implementations is closely tied to their susceptibility to sub-optimal local minima. Together with several subtle implementation differences, this makes it challenging to cross-compare different S3 VM algorithms. The aim of this paper is to provide a review of optimization techniques for semi-supervised SVMs and to bring different implementations, and various aspects of their empirical performance, under a common experimental setting. In Section 2 we discuss the general formulation of S3 VMs. In Sections 3 and 4 we provide an overview of various methods. We present a detailed empirical study in Section 5 and present a discussion on complexity in Section 6. 204 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Coil3 2moons ∇S3 VM 61.6 61 cS3 VM 61 37.7 CCCP 56.6 63.1 S3 VMlight 56.7 68.8 ∇DA 61.6 22.5 Newton 61.5 11 BB 0 0 Table 1: Generalization performance (error rates) of different S 3 VM algorithms on two (small) data sets Coil3 and 2moons. Branch and Bound (BB) yields the globally optimal solution which gives perfect separation. BB can only be applied to small data sets due to its high computational costs. See Section 5 for experimental details. 2. Semi-Supervised Support Vector Machines We consider the problem of binary classification. The training set consists of l labeled examples {(xi , yi )}li=1 , yi = ±1, and u unlabeled examples {xi }ni=l+1 , with n = l + u. In the linear S3 VM classification setting, the following minimization problem is solved over both the hyperplane parameters (w, b) and the label vector yu := [yl+1 . . . yn ]> , l n 1 min I (w, b, yu ) = kwk2 +C ∑ V (yi , oi ) +C? ∑ V (yi , oi ) 2 (w,b), yu i=1 i=l+1 (1) where oi = w> xi + b and V is a loss function. The Hinge loss is a popular choice for V , V (yi , oi ) = max (0, 1 − yi oi ) p . (2) It is common to penalize the Hinge loss either linearly (p = 1) or quadratically (p = 2). In the rest of the paper, we will consider p = 2. Non-linear decision boundaries can be constructed using the kernel trick (Vapnik, 1998). The first two terms in the objective function I in (1) define a standard SVM. The third term incorporates unlabeled data. The loss over labeled and unlabeled examples is weighted by two hyperparameters, C and C ? , which reflect confidence in labels and in the cluster assumption respectively. In general, C and C ? need to be set at different values for optimal generalization performance. The minimization problem (1) is solved under the following class balancing constraint, 1 n ∑ max(yi , 0) = r or equivalently u i=l+1 1 n ∑ yi = 2r − 1. u i=l+1 (3) This constraint helps in avoiding unbalanced solutions by enforcing that a certain user-specified fraction, r, of the unlabeled data should be assigned to the positive class. It was introduced with the first S3 VM implementation (Joachims, 1999). Since the true class ratio is unknown for the unlabeled data, r is estimated from the class ratio on the labeled set, or from prior knowledge about the classification problem. There are two broad strategies for minimizing I : 1. Combinatorial Optimization: For a given fixed yu , the optimization over (w, b) is a standard SVM training.1 Let us define: J (yu ) = min I (w, b, yu ). w,b 1. The SVM training is slightly modified to take into account different values for C and C ? . 205 (4) C HAPELLE , S INDHWANI AND K EERTHI The goal now is to minimize J over a set of binary variables. This combinatorial view of the optimization problem is adopted by Joachims (1999), Bie and Cristianini (2006), Xu et al. (2004), Sindhwani et al. (2006), Bennett and Demiriz (1998), and Chapelle et al. (2006c). There is no known algorithm that finds the global optimum efficiently. In Section 3 we review this class of techniques. 2. Continuous Optimization: For a fixed (w, b), arg miny V (y, o) = sign(o). Therefore, the optimal yu is simply given by the signs of oi = w> xi + b. Eliminating yu in this manner gives a continuous objective function over (w, b): l n 1 kwk2 +C ∑ max (0, 1 − yi oi )2 +C? ∑ max (0, 1 − |oi |)2 . 2 i=1 i=l+1 (5) This form of the optimization problem illustrates how S3 VMs implement the cluster assumption. The first two terms in (5) correspond to a standard SVM. The last term (see Figure 2) drives the decision boundary, that is, the zero output contour, away from unlabeled points. From Figure 2, it is clear that the objective function is non-convex. 1 Loss 0.8 0.6 0.4 0.2 0 −1 −0.5 0 Signed output 0.5 1 1.5 Figure 2: The effective loss max (0, 1 − |o|)2 is shown above as a function of o = (w> x + b), the real-valued output at an unlabeled point x. Note that in this form, the balance constraint becomes 1u ∑ni=l+1 sign(w> xi +b) = 2r −1 which is non-linear in (w, b) and not straightforward to enforce. In Section 4 we review this class of methods (Chapelle and Zien, 2005; Chapelle et al., 2006a; Fung and Mangasarian, 2001; Collobert et al., 2006). 3. Combinatorial Optimization We now discuss combinatorial techniques in which the labels y u of the unlabeled points are explicit optimization variables. Many of the techniques discussed in this section call a standard (or slightly modified) supervised SVM as a subroutine to perform the minimization over (w, b) for a fixed y u (see 4). 206 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES 3.1 Branch-and-Bound (BB) for Global Optimization The objective function (4) can be globally optimized using Branch-and-Bound techniques. This was noted in the context of S3 VM by Wapnik and Tscherwonenkis (1979) but no details were presented there. In general, global optimization can be computationally very demanding. The technique described in this section is impractical for large data sets. However, with effective heuristics it can produce globally optimal solutions for small-sized problems. This is useful for benchmarking practical S3 VM implementations. Indeed, as Table 1 suggests, the exact solution can return excellent generalization performance in situations where other implementations fail completely. Branch-andBound was first applied by Bennett and Demiriz (1998) in association with integer programming for solving linear S3 VMs. More recently Chapelle et al. (2006c) presented a Branch-and-Bound (BB) algorithm which we outline in this section. The main ideas are illustrated in Figure 3. Initial labeled set 12.7 y3=0 y3=1 14.3 y5=1 Objective function on currently labeled points 15.6 y5=0 y7=0 y7=1 23.3 Do not explore Increasing objective function 17.8 Best solution so far Figure 3: Branch-and-Bound Tree Branch-and-Bound effectively performs an exhaustive search over y u , pruning large parts of the solution space based on the following simple observation: suppose that a lower bound on minyu ∈A J (yu ), for some subset A of candidate solutions, is greater than J ( ỹu ) for some ỹu , then A can be safely discarded from exhaustive search. BB organizes subsets of solutions into a binary tree (Figure 3) where nodes are associated with a fixed partial labeling of the unlabeled data set and the two children correspond to the labeling of some new unlabeled point. Thus, the root corresponds to the initial set of labeled examples and the leaves correspond to a complete labeling of the data. Any node is then associated with the subset of candidate solutions that appear at the leaves of the subtree rooted at that node (all possible ways of completing the labeling, given the partial labeling at that node). This subset can potentially be pruned from the search by the Branch-and-Bound procedure if a lower bound over corresponding objective values turns out to be worse than an available solution. The effectiveness of BB depends on the following design issues: (1) the lower bound at a node and (2) the sequence of unlabeled examples to branch on. For the lower bound, Chapelle et al. (2006c) use the objective value of a standard SVM trained on the associated (extended) labeled 207 C HAPELLE , S INDHWANI AND K EERTHI set.2 As one goes down the tree, this objective value increases as additional loss terms are added, and eventually equals J at the leaves. Note that once a node violates the balance constraint, it can be immediately pruned by resetting its lower bound to ∞. Chapelle et al. (2006c) use a labelingconfidence criterion to choose an unlabeled example and a label on which to branch. The tree is explored on the fly by depth-first search. This confidence-based tree exploration is also intuitively linked to Label Propagation methods (Zhu and Ghahramani, 2002) for graph-transduction. On many small data sets (e.g., Table 1 data sets have up to 200 examples) BB is able to return the globally optimal solution in reasonable amount of time. We point the reader to Chapelle et al. (2006c) for pseudocode. 3.2 S3 VMlight S3 VMlight (Joachims, 1999) refers to the first S3 VM algorithm implemented in the popular SVMlight software.3 It is based on local combinatorial search guided by a label switching procedure. The vector yu is initialized as the labeling given by an SVM trained on the labeled set, thresholding outputs so that u × r unlabeled examples are positive. Subsequent steps in the algorithm comprise of switching labels of two examples in opposite classes, thus always maintaining the balance constraint. Consider an iteration of the algorithm where yu is the temporary labeling of the unlabeled data and let (w̃, b̃) = argminw,b I (w, b, yu ) and J (yu ) = I (w̃, b̃, yu ). Suppose a pair of unlabeled examples indexed by (i, j) satisfies the following condition, 4 yi = 1, y j = −1,V (1, oi ) +V (−1, o j ) > V (−1, oi ) +V (1, o j ) (6) where oi , o j are outputs of (w̃, b̃) on the examples xi , x j . Then after switching labels for this pair of examples and retraining, the objective function J can be easily shown to strictly decrease. S3 VMlight alternates between label-switching and retraining. Since the number of possible y u is finite, the procedure is guaranteed to terminate in a finite number of steps at a local minima of (4), that is, no further improvements are possible by interchanging two labels. In an outer loop, S3 VMlight gradually increases the value of C ? from a small value to the final value. Since C ? controls the non-convex part of the objective function (4), this annealing loop can be interpreted as implementing a “smoothing” heuristic as a means to protect the algorithm from sub-optimal local minima. The pseudocode is provided in Algorithm 1. 3.3 Deterministic Annealing S3 VM Deterministic annealing (DA) is a global optimization heuristic that has been used to approach hard combinatorial or non-convex problems. In the context of S 3 VMs (Sindhwani et al., 2006), it consists of relaxing the discrete label variables yu to real-valued variables pu = (pl+1 , . . . , pl+u ) where pi is interpreted as the probability that yi = 1. The following objective function is now considered: I 0 (w, b, pu ) = E [I (w, b, yu )] = (7) l n 1 kwk2 +C ∑ V (yi , oi ) +C? ∑ piV (1, oi ) + (1 − pi )V (−1, oi ) 2 i=1 i=l+1 2. Note that in this SVM training, the loss terms associated with (originally) labeled and (currently labeled) unlabeled examples are weighted by C and C ? respectively. 3. Note that in the S3 VM literature, this particular implementation is often referred as “TSVM” or “Transductive SVM”. 4. This switching condition is slightly weaker than that proposed by Joachims (1999). 208 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Algorithm 1 S3 VMlight Train an SVM with the labeled points. oi ← w · xi + b. Assign yi ← 1 to the ur largest oi , -1 to the others. C̃ ← 10−5C? while C̃ < C? do repeat Minimize (1) with {yi } fixed and C? replaced by C̃. if ∃(i, j) satisfying (6) then Swap the labels yi and y j end if until No labels have been swapped C̃ ← min(1.5C,C? ) end while where E denotes expectation under the probabilities pu . Note that at optimality with respect to pu , pi must concentrate all its mass on yi = sign(w> xi + b) which leads to the smaller of the two losses V (1, oi ) and V (−1, oi ). Hence, this relaxation step does not lead to loss of optimality and is simply a reformulation of the original objective in terms of continuous variables. In DA, an additional entropy term −H(pu ) is added to the objective, I 00 (w, b, pu ; T ) = I 0 (w, b, pu ) − T H(pu ) where H(pu ) = − ∑ pi log pi + (1 − pi ) log (1 − pi ), i and T ≥ 0 is usually referred to as ‘temperature’. Instead of (3), the following class balance constraint is used, 1 n ∑ pi = r. u i=l+1 Note that when T = 0, I 00 reduces to (7) and the optimal pu identifies the optimal yu . When T = ∞, I 00 is dominated by the entropy term resulting in the maximum entropy solution (p i = r for all i). T parameterizes a family of objective functions with increasing degrees of non-convexity (see Figure 4 and further discussion below). At any T , let (wT , bT , puT ) = argmin(w,b),pu I 00 (w, b, pu ; T ). This minimization can be performed in different ways: 1. Alternating Minimization: We sketch here the procedure proposed by Sindhwani et al. (2006). Keeping pu fixed, the minimization over (w, b) is standard SVM training—each unlabeled example contributes two loss terms weighted by C ? pi and C? (1 − pi ). Keeping (w, b) fixed, I 00 is minimized subject to the balance constraint 1u ∑ni=l+1 pi = r using standard Lagrangian techniques. This leads to: 1 (8) pi = 1 + e(gi −ν)/T where gi = C? [V (1, oi )−V (−1, oi )] and ν, the Lagrange multiplier associated with the balance constraint, is obtained by solving the root finding problem that arises by plugging (8) back in the balance constraint. The alternating optimization proceeds until p u stabilizes in a KLdivergence sense. This method will be referred to as DA in the rest of the paper. 209 C HAPELLE , S INDHWANI AND K EERTHI 2. Gradient Methods: An alternative possibility5 is to substitute the optimal pu (8) as a function of (w, b) and obtain an objective function over (w, b) for which gradient techniques can be used: S (w, b) := min I 00 (w, b, pu ; T ). (9) pu S (w, b) can be minimized by conjugate gradient descent. The gradient of S is easy to compute. Indeed, let us denote by p∗u (w, b) the argmin of (9). Then, n ∂p∗j (w) ∂I 0 (w, b, p?u (w)) ∂S ∂I 00 ∂I 00 = + ∑ = . ∂wi ∂wi j=l+1 ∂p j pu =p∗ (w,b) ∂wi ∂wi u {z } | 0 The partial derivative of I 00 with respect to p j is 0 by the definition of p∗u (w, b). The argument goes through even in the presence of the constraint 1u ∑ pi = r; see Chapelle et al. (2002, Lemma 2) for a formal proof. In other words, we can compute the gradient of (7) with respect to w and consider pu fixed. The same holds for b. This method will be referred to as ∇DA in the rest of the paper. Figure 4 shows the effective loss terms in S associated with an unlabeled example for various values of T . In an outer loop, starting from a high value, T is decreased geometrically by a constant factor. The vector pu is then tightened back close to discrete values (its entropy falls below some threshold), thus identifying a solution to the original problem. The pseudocode is provided in Algorithm 2. Table 2 compares the DA and ∇DA solutions as T → 0 at two different hyperparameter settings.6 Because DA does alternate minimization and ∇DA does direct minimization, the solutions returned by them can be quite different. Since ∇DA is faster than DA, we only report ∇DA results in the Section 5. Algorithm 2 DA/∇DA Initialize pi = r i = l + 1, . . . , n Set T = 10C? , R = 1.5, ε = 10−6 . while H(puT ) > ε do Solve (wT , bT , puT ) = argmin(w,b),pu I 00 (w, b, pu ; T ) subject to: 1u ∑ni=l+1 pi = r (find local minima starting from previous solution—alternating optimization or gradient methods can be used.) T = T /R end while Return wT , bT 3.4 Convex Relaxation We follow Bie and Cristianini (2006) in this section, but outline the details for the squared Hinge loss (see also Xu et al., 2004, for a similar derivation). Rewriting (1) as the familiar constrained 5. Strictly speaking, this approach is more along the lines of methods discussed in Section 4. 6. In Sindhwani et al. (2006), the best solution in the optimization path is returned. 210 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES 1 0.5 Decreasing T loss 0 −0.5 −1 −1.5 −2 −3 −2 −1 0 1 2 3 output Figure 4: DA parameterizes a family of loss functions (over unlabeled examples) where the degree of non-convexity is controlled by T . As T → 0, the original loss function (Figure 2) is recovered. g50c Text Uspst Isolet Coil20 Coil3 2moons DA 6.5 13.6 22.5 38.3 3 49.1 36.8 ∇DA 7 5.7 27.2 39.8 12.3 61.6 22.5 DA 8.3 6.2 11 28.6 19.2 60.3 62.1 ∇DA 6.7 6.5 11 26.9 18.9 60.6 30 Table 2: Generalization performance (error rates) of the two DA algorithms. DA is the original algorithm (alternate optimization on pu and w). ∇DA is a direct gradient optimization on (w, b), where pu should be considered as a function of (w, b). The first two columns report results when C? = C and the last columns report results when C ? = C/100. See Section 5 for more experimental details. optimization problem of SVMs: n l 1 kwk2 +C ∑ ξ2i +C? ∑ ξ2i subject to: yi oi ≥ 1 − ξi i = 1, . . . , n. (w,b),yu 2 i=1 i=l+1 min Consider the associated dual problem: n min max {yi } α ∑ αi − i=1 1 n αi α j yi y j Ki j subject to: 2 i,∑ j=1 n ∑ αi yi = 0, αi ≥ 0 i=1 1 1 where Ki j = x> i x j +Di j and D is a diagonal matrix given by Dii = 2C , i = 1, . . . , l and Dii = 2C? , i = l + 1, . . . , n. 211 C HAPELLE , S INDHWANI AND K EERTHI Introducing an n × n matrix Γ, the optimization problem can be reformulated as: n min max Γ α ∑ αi − i=1 under constraints 1 n αi α j Γi j Ki j 2 i,∑ j=1 ∑ αi yi = 0, αi ≥ 0, Γ = yy> . (10) (11) The objective function (10) is now convex since it is the pointwise supremum of linear functions. However the constraint (11) is not. The idea of the relaxation is to replace the constraint Γ = yy > by the following set of convex constraints: Γ 0, Γi j = yi y j , 1 ≤ i, j ≤ l, Γii = 1, l + 1 ≤ i ≤ n. Though the original method of Bie and Cristianini (2006) does not incorporate the class balancing constraint (3), one can additionally enforce it as u12 ∑ni, j=l+1 Γi j = (2r − 1)2 . Such a soft constraint is also used in the continuous S3 VM optimization methods of Section 4. The convex problem above can be solved through Semi-Definite Programming. The labels of the unlabeled points are estimated from Γ (through one of its columns or its largest eigenvector). This method is very expensive and scales as O((l + u2 )2 (l + u)2.5 ). It is possible to try to optimize a low rank version of Γ, but the training remains slow even in that case. We therefore do not conduct empirical studies with this method. 4. Continuous Optimization In this section we consider methods which do not include y u , the labels of unlabeled examples, as optimization variables, but instead solve suitably modified versions of (5) by continuous optimization techniques. We begin by discussing two issues that are common to these methods. Balancing Constraint The balancing constraint (3) is relatively easy to enforce for all algorithms presented in Section 3. It is more difficult for algorithms covered in this section. The proposed workaround, first introduced by Chapelle and Zien (2005), is to instead enforce a linear constraint: 1 n ∑ w> xi + b = 2r̃ − 1, u i=l+1 (12) where r̃ = r. The above constraint may be viewed as a “relaxation” of (3). For a given r̃, an easy way of enforcing (12) is to translate all the points such that the mean of the unlabeled points is the origin, that is, ∑ni=l+1 xi = 0. Then, by fixing b = 2r̃ − 1, we have an unconstrained optimization problem on w. We will assume that the xi are translated and b is fixed in this manner; so the discussion will focus on unconstrained optimization procedures for the rest of this section. In addition to being easy to implement, this linear constraint may also add some robustness against uncertainty about the true unknown class ratio in the unlabeled set (see also Chen et al., 2003, for related discussion). However, since (12) relaxes (3),7 the solutions found by algorithms in this section cannot strictly be compared with those in Section 3. In order to admit comparisons (this is particularly important for the empirical study in Section 5), we vary r̃ in an outer loop and do a dichotomic search on this value such that (3) is satisfied. 7. Note that simply setting r̃ = r in (12) will not enforce (3) exactly. 212 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Primal Optimization For linear classification, the variables in w can be directly optimized. Nonlinear decision boundaries require the use of the “kernel trick” (Boser et al., 1992) using a kernel function k(x, x0 ). While most of the methods of Section 3 can use a standard SVM solver as a subroutine, the methods of this section need to solve (5) with a non-convex loss function over unlabeled examples. Therefore, they cannot directly use off-the-shelf dual-based SVM software. We use one of the following primal methods to implement the techniques in this section. Method 1 We find zi such that zi · z j = k(xi , x j ). If B is a matrix having columns zi , this can be written in matrix form as B> B = K. The Cholesky factor of K provides one such B. This decomposition was used for ∇S3 VM (Chapelle and Zien, 2005). Another possibility is to perform the eigendecomposition of K as K = UΛU > and set B = Λ1/2U > . This latter case corresponds to the kernel PCA map introduced by Schölkopf and Smola (2002, Section 14.2). Once the zi are found, we can simply replace xi in (5) by zi and solve a linear classification problem. For more details, see Chapelle et al. (2006a). Method 2 We set w = ∑ni=1 βi φ(xi ) where φ denotes a higher dimensional feature map associated with the nonlinear kernel. By the Representer theorem (Sch ölkopf and Smola, 2002), we indeed know that the optimal solution has this form. Substituting this form in (5) and using the kernel function yields an optimization problem with β as the variables. Note that the centering mentioned above to implement (12) corresponds to using the modified kernel Schölkopf and Smola (2002, page 431) defined by: k(x, x0 ) := k(x, x0 ) − 1 n 1 n 1 n 0 k(x, x ) − k(x , x ) + i i ∑ k(xi , x j ). ∑ ∑ u i=l+1 u i=l+1 u2 i, j=l+1 (13) All the shifted kernel elements can be computed in O(n2 ) operations. Finally, note that these methods are very general and can also be applied to algorithms of Section 3. 4.1 Concave Convex Procedure (CCCP) The CCCP method (Yuille and Rangarajan, 2003) has been applied to S 3 VMs by Fung and Mangasarian (2001), Collobert et al. (2006), and Wang et al. (2007). The description given here is close to that in Collobert et al. (2006). CCCP essentially decomposes a non-convex function f into a convex component f vex and a concave component f cave . At each iteration, the concave part is replaced by a linear function (namely, the tangential approximation at the current point) and the sum of this linear function and the convex part is minimized to get the next iterate. The pseudocode is shown in Algorithm 3. Algorithm 3 CCCP for minimizing f = f vex + fcave Require: Starting point x0 t ←0 while ∇ f (xt ) 6= 0 do xt+1 ← arg minx fvex (x) + ∇ fcave (xt ) · x t ← t +1 end while 213 C HAPELLE , S INDHWANI AND K EERTHI In the case of S3 VM, the first two terms in (5) are convex. Splitting the last non-convex term corresponding to the unlabeled part as the sum of a convex and a concave function, we have: max(0, 1 − |t|)2 = max(0, 1 − |t|)2 + 2|t| −2|t| . {z } | {z } | convex concave If an unlabeled point is currently classified positive, then at the next iteration, the effective (convex) loss on this point will be if t ≥ 1, 0 2 (1 − t) if |t| < 1, L̃(t) = −4t if t ≤ −1. A corresponding L̃ can be defined for the case of an unlabeled point being classified negative. The CCCP algorithm specialized to S3 VMs is given in Algorithm 4. For optimization variables we employ method 1 given at the beginning of this section. Algorithm 4 CCCP for S3 VMs Starting point: Use the w obtained from the supervised SVM solution. repeat yi ← sign(w · xi + b), l + 1 ≤ i ≤ n. (w, b) = arg min 12 kwk2 +C ∑li=1 max(0, 1 − yi (w · xi + b))2 +C? ∑ni=l+1 L̃(yi (w · xi + b)). until convergence of yi , l + 1 ≤ i ≤ n. The CCCP method given in Collobert et al. (2006) does not use annealing, that is, increasing C? slowly in steps as in S3 VMlight to help reduce local minima problems. We have however found performance improvements with annealing (see Table 12). 4.2 ∇S3 VM This method is proposed by Chapelle and Zien (2005) to minimize directly the objective function (5) by gradient descent. For optimization variables, method 1 given at the beginning of this section is used. Since the function t 7→ max(0, 1 − |t|)2 is not differentiable, it is replaced by t 7→ exp(−st 2 ), with s = 5 (see Figure 5), to get the following smooth optimization problem: 8 min w,b n l 1 kwk2 +C ∑ max(0, 1 − yi (w · xi + b))2 +C? ∑ exp(−s(w · xi + b)2 ). 2 i=1 i=l+1 (14) As for S3 VMlight , ∇S3 VM performs annealing in an outer loop on C ? . In the experiments we followed the same annealing schedule as in Chapelle and Zien (2005): C ? is increased in 10 steps to its final value. More precisely, at the ith step, C ? is set to 2i−10C?f inal . 4.3 Continuation S3 VM (cS3 VM) Closely related to ∇S3 VM, Chapelle et al. (2006a) proposes a continuation method for minimizing (14). Gradient descent is performed on the same objective function (with the same loss for the 8. Chapelle and Zien (2005) used s = 3 with hinge loss, p = 1 in (2), but s = 5 seems to be a better choice for quadratic hinge loss (p = 2). 214 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Standard L2 loss 1.2 Differentiable approximation 1 Loss 0.8 0.6 0.4 0.2 0 −1.5 −1 −0.5 0 0.5 Signed output 1 1.5 Figure 5: The loss function on the unlabeled points t 7→ max(0, 1 − |t|) 2 is replaced by a differentiable approximation t 7→ exp(−5t 2 ). unlabeled points as shown in Figure 5), but the method used for annealing is different. Instead of slowly increasing C ? , it is kept fixed, and a continuation technique is used to transform the objective function. This kind of method belongs to the field of global optimization techniques (Wu, 1996). The idea is similar to deterministic annealing (see Figure 6). A smoothed version of the objective function is first minimized. With enough smoothing the global minimum can hopefully be easily found. Then the smoothing is decreased in steps and the minimum is tracked—the solution found in one step serves as the starting point for the next step. The method is continued until there is no smoothing and so we get back to the solution of (14). Algorithm 5 gives an instantiation of the method in which smoothing is achieved by convolution with a Gaussian, but other smoothing functions can also be used. Algorithm 5 Continuation method for solving minx f (x) Require: Function f : Rd 7→ R, initial point x0 ∈ Rd Require: Sequence γ0 > γ > . . . γ p−1 > γ p = 0. R 1 Let fγ (x) = (πγ)−d/2 f (x − t) exp(−ktk2 /γ)dt. for i = 0 to p do Starting from xi , find local minimizer xi+1 of fγi . end for It is not clear if one should only smooth the last non-convex term of (14) (the first two terms are convex) or the whole objective as in Chapelle et al. (2006a). It is noteworthy that since the loss for the unlabeled points is bounded, its convolution with a Gaussian of infinite width tends to the zero function. In other words, with enough smoothing, the unlabeled part of the objective function vanishes and the optimization is identical to a standard SVM. 215 C HAPELLE , S INDHWANI AND K EERTHI large γ smaller γ γ=0 Figure 6: Illustration of the continuation method: the original objective function in blue has two local minima. By smoothing it, we find one global minimum (the red star on the green curve). By reducing the smoothing, the minimum moves toward the global minimum of the original function. 4.4 Newton S3 VM One difficulty with the methods described in sections 4.2 and 4.3 is that their complexity scales as O(n3 ) because they employ an unlabeled loss function that does not have a linear part, for example, see (14). Compared to a method like S3 VMlight (see Section 3.2), which typically scales as O(n3sv + n2 ), this can make a large difference in efficiency when n sv (the number of support vectors) is small. In this subsection we propose a new loss function for unlabeled points and an associated Newton method (along the lines of Chapelle, 2007) which brings down the O(n 3 ) complexity of the ∇S3 VM method. To make efficiency gains we employ method 2 described at the beginning of this section (note that method 1 requires an O(n3 ) preprocessing step) and perform the minimization on β, where w = ∑ni=1 βi φ(xi ). Note that the βi are expansion coefficients and not the Lagrange multipliers α i in standard SVMs. Let us consider general loss functions, ` L for the labeled points, `U for the unlabeled points, replace w by β as the variables in (5), and get the following optimization problem, n l 1 min β> Kβ +C ∑ `L (yi (Ki> β + b)) +C? ∑ `U (Ki> β + b), β 2 i=1 i=l+1 (15) where K is the kernel matrix with Ki j = k(xi , x j ) and Ki is the ith column of K.9 As we will see in detail below, computational time is dictated by n sv , the number of points that lie in the domain of the loss function where curvature is non-zero. With this motivation we choose the differentiable loss function plotted in Figure 7 having several linear and flat parts which are smoothly connected by small quadratic components. 10 9. Note that, the kernel elements used here correspond to the modified kernel elements in (13). 10. The CCCP method of Collobert et al. (2006) also considers a loss function with a flat middle part, but it was not motivated by computational gains. 216 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −1 −0.5 0 0.5 1 Figure 7: The piecewise quadratic loss function `U and its derivative (divided by 4; dashed line). Consider the solution of (15) with this choice of loss function. The gradient of (15) is βi +C`0L (yi (Ki> β + b))yi 1 ≤ i ≤ l . Kg with gi = βi +C? `U0 (Ki> β + b) l +1 ≤ i ≤ n (16) Using a gradient based method like nonlinear conjugate gradient would sill be costly because each evaluation of the gradient requires O(n2 ) effort. To improve the complexity when nsv is small, one can use Newton’s method instead. Let us now go into these details. The Hessian of (15) is C`00L (yi (Ki> β + b)) 1 ≤ i ≤ l K + KDK, with D diagonal, Dii = . (17) C? `U00 (Ki> β + b) l +1 ≤ i ≤ n The corresponding Newton update is β ← β − (K + KDK)−1 Kg. The advantage of Newton optimization on this problem is that the step can be computed in O(n 3sv + n2 ) time11 as we will see below (see also Chapelle, 2007, for a similar derivation). The number of Newton steps required is usually a small finite constant. A problem in performing a Newton optimization with a non-convex optimization function is that the step might not be a descent direction because the Hessian is not necessarily positive definite. To avoid this problem, we use the Levenberg-Marquardt algorithm (Fletcher, 1987, Algorithm 5.2.7). Roughly speaking, this algorithm is the same as Newton minimization, but a large enough ridge is added to the Hessian such that it becomes positive definite. For computational reasons, instead of adding a ridge to the Hessian, we will add a constant times K. The goal is to choose a λ ≥ 1 such that λK +KDK is positive definite and solve (λK +KDK) −1 Kg efficiently. For this purpose, we reorder the points such that D ii 6= 0, i ≤ nsv and Dii = 0, i > nsv . Let A be the Cholesky decomposition of K:12 A is the upper triangular matrix satisfying A> A = K. We suppose that K (and thus A) is invertible. Let us write λK + KDK = A> (λIn + ADA> )A. 11. Consistent with the way we defined earlier, note here that nsv is the number of “support vectors” where a support vector is defined as a point xi such that Dii 6= 0. 12. As we will see below, we will not need the Cholesky decomposition of K but only that of of Ksv . 217 C HAPELLE , S INDHWANI AND K EERTHI The structure of K and D implies that λK + KDK 0 ⇔ B := λInsv + Asv Dsv A> sv 0, where Asv is the Cholesky decomposition of Ksv and Ksv is the matrix formed using the first nsv rows and columns of K. After some block matrix algebra, we can also get the step as −1 −1 Asv B Asv (gsv − λ1 Dsv Ksv,nsv gnsv ) −1 −(λK + KDK) Kg = , (18) 1 λ gnsv where nsv refers to the indices of the ”non support vectors”, that is, {i, D ii = 0}. Computing this direction takes O(n3sv + n2 ) operations. The checking of the positive definiteness of B can be done by doing Cholesky decomposition of Ksv . This decomposition can then be reused to solve the linear system involving B. Full details, following the ideas in Fletcher (1987, Algorithm 5.2.7), are given in Algorithm 6. Algorithm 6 Levenberg-Marquardt method β ← 0. λ ← 1. repeat Compute g and D using (16) and (17) sv ← {i, Dii 6= 0} and nsv ← {i, Dii = 0}. Asv ← Cholesky decomposition of Ksv . Do the Cholesky decomposition of λInsv + Asv Dsv A> sv . If it fails, λ ← 4λ and try again. Compute the step s as given by (18). Ω(β+s)−Ω(β) ρ ← 1 s> (K+KDK)s+s % If the obj fun Ω were quadratic, ρ would be 1. > Kg . 2 If ρ > 0, β ← β + s. If ρ < 0.25, λ ← 4λ. If ρ > 0.75, λ ← min(1, λ2 ). until Norm(g) ≤ ε As discussed above, the flat part in the loss (cf. Figure 7) provides computational value by reducing nsv . But we feel it may also possibly help in leading to better local minimum. Take for instance a Gaussian kernel and consider an unlabeled point far from the labeled points. At the beginning of the optimization, the output on that point will be 0. 13 This unlabeled point does not contribute to pushing the decision boundary one way or the other. This seems like a satisfactory behavior: it is better to wait to have more information before taking a decision on an unlabeled point for which we are unsure. In Table 3 we compare the performance of the flat loss in Figure 7 and the original quadratic loss used in (5). The flat loss yields a huge gain in performance on 2moons. On the other data sets the two losses perform somewhat similarly. From a computational point of view, the flat part in the loss can sometimes reduce the training time by a factor 10 as shown in Table 3. 5. Experiments This section is organized around a set of empirical issues: 13. This is true only for balanced problems; otherwise, the output is b. 218 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES g50c Text Uspst Isolet Coil20 Coil3 2moons Error rate Flat Quadratic 6.1 5.3 5.4 7.7 18.6 15.9 32.2 27.1 24.1 24.7 61.5 58.4 11 66.4 Training time Flat Quadratic 15 39 2180 2165 233 2152 168 1253 152 1244 6 8 1.7 1.2 Table 3: Comparison of the Newton-S3 VM method with two different losses: the one with a flat part in the middle (see Figure 7) and the standard quadratic loss (Figure 2). Left: error rates on the unlabeled set; right: average training time in seconds for one split and one binary classifier training (with annealing and dichotomic search on the threshold as explained in the experimental section). The implementations have not been optimized, so the training times only constitute an estimate of the relative speeds. 1. While S3 VMs have been very successful for text classification (Joachims, 1999), there are many data sets where they do not return state-of-the-art empirical performance (Chapelle and Zien, 2005). This performance variability is conjectured to be due to local minima problems. In Section 5.3, we discuss the suitability of the S3 VM objective function for semi-supervised learning. In particular, we benchmark current S3 VM implementations against the exact, globally optimal solution and we discuss whether one can expect significant improvements in generalization performance by better approaching the global solution. 2. Several factors influence the performance and behavior of S 3 VM algorithms. In Section 5.4 we study their quality of optimization, generalization performance, sensitivity to hyperparameters, effect of annealing and the robustness to uncertainty in class ratio estimates. 3. S3 VMs were originally motivated by Transductive learning, the problem of estimating labels of unlabeled examples without necessarily producing a decision function over the entire input space. However, S3 VMs are also semi-supervised learners as they are able to handle unseen test instances. In Section 5.5, we run S3 VM algorithms in an inductive mode and analyze performance differences between unlabeled and test examples. 4. There is empirical evidence that S3 VMs exhibit poor performances on “manifold” type data (where graph-based methods typically excel) or when the data has many distinct sub-clusters (Chapelle and Zien, 2005). We explore the issue of data geometry and S 3 VM performance in Section 5.6. At the outset, we point out that this section does not provide an exhaustive cross-comparison between algorithms. Such a comparison would require, say, cross-validation over multiple hyperparameters, randomization over choices of labels, dichotomic search to neutralize balance constraint differences and handling different choices of annealing sequences. This is computationally quite demanding and, more seriously, statistically brittle due to the lack of labeled validation data in 219 C HAPELLE , S INDHWANI AND K EERTHI semi-supervised tasks. Our goal, therefore, is not so much to establish a ranking of algorithms reviewed in this paper, but rather to observe their behavior under a neutral experimental protocol. We next describe the data sets used in our experiments. Note that our choice of data sets is biased towards multi-class manifold-like problems which are particularly challenging for S 3 VMs. Because of this choice, the experimental results do not show the typically large improvements one might expect over standard SVMs. We caution the reader not to draw the conclusion that S 3 VM is a weak algorithm in general, but that it often does not return state-of-the-art performance on problems of this nature. 5.1 Data Sets Most data sets come from Chapelle and Zien (2005). They are summarized in Table 4. data set g50c Text Uspst Isolet Coil20 Coil3 2moons classes 2 2 10 9 20 3 2 dims 50 7511 256 617 1024 1024 102 points 550 1946 2007 1620 1440 216 200 labeled 50 50 50 50 40 6 2 Table 4: Basic properties of benchmark data sets. The artificial data set g50c is inspired by Bengio and Grandvalet (2004): examples are generated from two standard normal multi-variate Gaussians, the labels correspond to the Gaussians, and the means are located in 50-dimensional space such that the Bayes error is 5%. The real world data sets consist of two-class and multi-class problems. The Text data set is defined using the classes mac and mswindows of the Newsgroup20 data set preprocessed as in Szummer and Jaakkola (2001). The Uspst set contains the test data part of the well-known USPS data on handwritten digit recognition. The Isolet is a subset of the ISOLET spoken letter database (Cole et al., 1990) containing the speaker sets 1, 2 and 3 and 9 confusing letters {B,C,D,E,G,P,T,V,Z}. In Coil20 (respectively Coil3), the data are gray-scale images of 20 (respectively 3) different objects taken from different angles, in steps of 5 degrees (Nene et al., 1996). The Coil3 data set has been used first by Chapelle et al. (2006c) and is particularly difficult since the 3 classes are 3 cars which look alike (see Figure 8). Figure 8: The 3 cars from the COIL data set, subsampled to 32×32 220 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Finally, 2moons has been used extensively in the semi-supervised learning literature (see for instance Zhu and Ghahramani, 2002, and Figure 1). For this data set, the labeled points are fixed and new unlabeled points are randomly generated for each repetition of the experiment. 5.2 Experimental Setup To minimize the influence of external factors, unless stated otherwise, the experiments were run in the following normalized way: Hyperparameters The Gaussian kernel k(x, y) = exp(−kx − yk 2 /2σ2 ) was used. For simplicity the constant C? in (1) was set to C.14 The same hyperparameters C and σ have been used for all the methods. They are found with cross-validation by training an inductive SVM on the entire data set (the unlabeled points being assigned their real label). These values are reported in Table 5. Even though it seems fair to compare the algorithms with the same hyperparameters, there is a possibility that some regime of hyperparameter settings is more suitable for a particular algorithm than for another. We discuss this issue in section 5.4.3. Multi-class Data sets with more than two classes are learned with a one-versus-the-rest approach. The reported objective value is the mean of the objective values of the different classifiers. We also conducted experiments on pair-wise binary classification problems (cf. Section 5.6.1) constructed from the multi-class data sets. Objective value Even though the objective function that we want to minimize is (5), some algorithms like ∇S3 VM use another (differentiable) objective function. In order to compare the objective values of the different algorithms, we do the following: after training, we predict the labels of the unlabeled points and train a standard SVM on this augmented labeled set (with p = 2 in (2) and C, C ? weightings on originally labeled and unlabeled terms respectively). The objective value reported is the one of this SVM. Balancing constraint For the sake of simplicity, we set r in the balancing constraint (3) to the true ratio of the positive points in the unlabeled set. This constraint is relatively easy to enforce for all algorithms presented in Section 3. It is more difficult for algorithms in Section 4 and, for them we used the dichotomic search described at the beginning of Section 4. For a given data set, we randomly split the data into a labeled set and an unlabeled set. We refer to the error rate on the unlabeled set as the unlabeled error to differentiate it from the test error which would be computed on an unseen test set. Results are averaged over 10 random splits. The difference between unlabeled and test performance is discussed in Section 5.5. 5.3 Suitability of the S3 VM Objective Function Table 1 shows unlabeled error rates for common S3 VM implementations on two small data sets Coil3 and 2moons. On these data sets, we are able to also run Branch-and-Bound and get the true globally optimal solution. We see that the global optimum corresponds to a perfect solution, while the local minima found by approximate implementations yield very poor accuracies. From these results, it appears that the minimization of the S3 VM objective function makes good sense, even 14. Alternatively, one could set C ? = C ul to have equal contribution from labeled and unlabeled points. 221 C HAPELLE , S INDHWANI AND K EERTHI g50c Text Uspst Isolet Coil20 Coil3 2moons σ 38 3.5 7.4 15 2900 3000 0.5 C 19 31 38 43 37 100 10 Table 5: Values of the hyperparameters used in the experiments. though the performance of practical S3 VM may not consistently reflect this due to local minima problems. Table 6 records the rank correlation between unlabeled error and objective function. The rank correlation has been computed in the following way. For each split, we take 10 different solutions and compute the associated unlabeled error and objective value. Ideally, we would like to sample these solutions at random around local minima. But since it is not obvious how to do such a sampling, we simply took the solution given by the different S 3 VM algorithms as well as 4 “intermediate” solutions obtained as follows. A standard SVM is trained on the original labeled set and a fraction of the unlabeled set (with their true labels). The fraction was either 0, 10, 20 or 30%. The labels of the remaining unlabeled points are assigned through a thresholding of the real value outputs of the SVM. This threshold is such that the balancing constraint (3) is satisfied. Finally, an SVM is retrained using the entire training set. By doing so, we “sample” solutions varying from an inductive SVM trained on only the labeled set to the optimal solution. Table 6 provides evidence that the unlabeled error is correlated with the objective values. g50c Text Uspst Isolet Coil20 Coil3 2moons Coefficient 0.2 0.67 0.24 0.23 0.4 0.17 0.45 Table 6: Kendall’s rank correlation (Abdi, 2006) between the unlabeled error and the objective function averaged over the 10 splits (see text for details). 5.4 Behavior of S3 VM Algorithms Several factors influence the performance and behavior of S 3 VM algorithms. We study their quality of optimization, generalization performance, sensitivity to hyperparameters, effect of annealing and the robustness to uncertainty in class ratio estimates. 222 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES 5.4.1 Q UALITY OF M INIMIZATION In Table 7 we compare different algorithms in terms of minimization of the objective function. ∇S3 VM and cS3 VM appear clearly to be the methods achieving the lowest objective values. However, as we will see in the next section, this does not necessarily translate into lower unlabeled errors. ∇S3 VM 1.7 cS3 VM 1.9 CCCP 4.5 S3 VMlight 4.9 ∇DA 4.3 Newton 3.7 Table 7: For each data set and each split, the algorithms were ranked according to the objective function value they attained. This table shows the average ranks. These ranks are only about objective function values; error rates are discussed below. 5.4.2 Q UALITY OF G ENERALIZATION Table 8 reports the unlabeled errors of the different algorithms on our benchmark data sets. A first observation is that most algorithms perform quite well on g50c and Text. However, the unlabeled errors on the other data sets, Uspst, Isolet, Coil20, Coil3, 2moons (see also Table 1) are poor and sometimes even worse than a standard SVM. As pointed out earlier, these data sets are particularly challenging for S3 VMs. Moreover, the honors are divided and no algorithm clearly outperforms the others. We therefore cannot give a recommendation on which one to use. g50c Text Uspst Isolet Coil20 ∇S3 VM cS3 VM CCCP S3 VMlight ∇DA Newton SVM SVM-5cv 6.7 5.1 15.6 25.8 25.6 6.4 5.3 36.2 59.8 30.7 6.3 8.3 16.4 26.7 26.6 6.2 8.1 15.5 30 25.3 7 5.7 27.2 39.8 12.3 6.1 5.4 18.6 32.2 24.1 8.2 14.8 20.7 32.7 24.1 4.9 2.8 3.9 6.4 0 Table 8: Unlabeled errors of the different S3 VMs implementations. The next to the last column is an SVM trained only on the labeled data, while the last column reports 5 fold cross-validation results of an SVM trained on the whole data set using the labels of the unlabeled points. The values in these two columns can be taken as upper and lower bounds on the best achievable error rates. See Table 1 for results on Coil3 and 2moons. Note that these results may differ from the ones previously reported by Chapelle and Zien (2005), Collobert et al. (2006), Sindhwani et al. (2006), and Chapelle et al. (2006a) on the same data sets. Most of this difference comes from the choice of the hyperparameters. Indeed, as explained below, several algorithms are rather sensitive to the choice of hyperparameters. The exact experimental setting is also different. In results reported elsewhere, r is often estimated from the labeled set and the constraint is often a “soft” one. In Table 8, the hard balance constraint is enforced for all methods assuming r to be known exactly. Finally, in Chapelle et al. (2006a), only pair-wise 223 C HAPELLE , S INDHWANI AND K EERTHI binary problems were considered for the cS3 VM algorithm, while results in Table 8 for multiclass data sets are obtained under a one-versus-the-rest setup. 5.4.3 S ENSITIVITY TO H YPERPARAMETERS As mentioned before, it is possible that different algorithms excel in different hyperparameter regimes. This is more likely to happen due to better local minima at different hyperparameters as opposed to a better global solution (in terms of error rates). Due to computational constraints, instead of setting the three hyperparameters (C,C ? and the Gaussian kernel width, σ) by cross-validation for each of the methods, we explored the influence of these hyperparameters on one split of the data. More specifically, Table 9 shows the relative improvement (in %) that one can gain by selecting other hyperparameters. These numbers may be seen as a measure of the robustness of a method. Note that these results have to be interpreted carefully because they are only on one split: for instance, it is possible that“by chance” the method did not get stuck in a bad local minimum for one of the hyperparameter settings, leading to a larger number in Table 9. g50c Text Uspst Isolet Coil20 Coil3 2moons Mean ∇S3 VM 31.2 22 12 7.6 46.4 32.6 45.6 28.2 cS3 VM 31.2 7.1 70.2 57.6 42.7 39.2 50 42.6 CCCP 27.8 29.2 19.2 8 27.9 15.3 54.1 25.9 S3 VMlight 13.3 19.9 0 0 5.8 20.2 13.5 10.4 ∇DA 35 34.4 41 4.9 5.7 5.8 23.5 21.5 Newton 7.7 1.1 15.6 2.6 16.9 24.6 0 9.8 Table 9: On the 1st split of each data set, 27 set of hyperparameters (σ 0 ,C0 ,C? 0 ) have been tested 1 1 1 0 0 C,C, 10C},C? 0 ∈ { 100 C0 , 10 C ,C }. The table shows the from σ0 ∈ { 21 σ, σ, 2σ},C0 ∈ { 10 relative improvement (in %) by taking the best hyperparameters over default ones. By measuring the variation of the unlabeled errors with respect to the choice of the hyperparameters, Table 10 records an indication of the sensitivity of the method with respect to that choice. From this point of view S3 VMlight appears to be the most stable algorithm. ∇S3 VM 6.8 cS3 VM 8.5 CCCP 6.5 S3 VMlight 2.7 ∇DA 8.4 Newton 8.7 Table 10: The variance of the unlabeled errors have been averaged over the 27 possible hyperparameters (cf. Table 9). The table shows those variance averaged over the data sets. A small number shows that a given method is not too sensitive to the choice of the hyperparameters. 224 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Finally, from the experiments in Table 9, we observed that in some cases a smaller value of C ? is helpful. We have thus rerun the algorithms on all the splits with C ? divided by 100: see Table 11. The overall performance does not necessarily improve, but the very poor results become better (see for instance Uspst,Isolet,Coil20 for cS3 VM). g50c Text Uspst Isolet Coil20 Coil3 2moons ∇S3 VM 8.3 5.7 14.1 27.8 23.9 60.8 65.0 cS3 VM 8.3 5.8 15.6 28.5 23.6 55.0 49.8 CCCP 8.5 8.5 14.9 25.3 23.6 56.3 66.3 S3 VMlight 8.4 8.1 14.5 29.1 21.8 59.2 68.7 ∇DA 6.7 6.5 11 26.9 18.9 60.6 30 Newton 7.5 14.5 19.2 32.1 24.6 60.5 33.5 Table 11: Same as Table 8, but with C ? divided by 100. 5.4.4 E FFECT OF A NNEALING S CHEDULE All the algorithms described in this paper use some sort of annealing (e.g., gradually decreasing T in DA or increasing C ? in S3 VMlight in an outer loop) where the role of the unlabeled points is progressively increased. The three ingredients to fix are: 1. The starting point which is usually chosen in such a way that the unlabeled points have very little influence and the problem is thus almost convex. 2. The stopping criterion which should be such that the annealed objective function and the original objective function are very close. 3. The number of steps. Ideally, one would like to have as many steps as possible, but for computational reasons the number of steps is limited. In the experimental results presented in Figure 9, we only varied the number of steps. The starting and final values are as indicated in the description of the algorithms. The original CCCP paper did not have annealing and we used the same scheme as for ∇S 3 VM: C? is increased exponentially from 10−3C to C. For DA and ∇DA, the final temperature is fixed at a small constant and the decrease rate R is such that we have the desired number of steps. For all algorithms, one step means that there is no annealing. One has to be cautious when drawing conclusion from this plot. Indeed, the results are only for one data set, one split and fixed hyperparameters. The goal is to get an idea of whether the annealing for a given method is useful; and if so, how many steps should be taken. From this plot, it seems that: • All methods, except Newton, seem to benefit from annealing. However, on some other data sets, annealing improved the performances of Newton’s method (results not shown). • Most methods do not require a lot of steps. More precisely, we have noticed that the number of steps does not matter as much as the minimization around a “critical” value of the annealing parameter; if that point is missed, then the results can be bad. 225 C HAPELLE , S INDHWANI AND K EERTHI 0.4 3 ∇−S VM 3 cS VM CCCP light SVM DA Newton ∇DA Test error 0.3 0.2 0.1 0 1 2 4 8 Number of steps 16 32 Figure 9: Influence of the number of annealing steps on the first split of Text. g50c Text Uspst Isolet Coil20 Coil3 2moons No annealing 7.9 14.7 21.3 32.4 26.1 49.3 67.1 Annealing 6.3 8.3 16.4 26.7 26.6 56.6 63.1 Table 12: Performance of CCCP with and without annealing. The annealing schedule is the same as the one used for ∇S3 VM and Newton: C ? is increased in 10 steps from its final value divided by 1000. • DA seems the method which relies the most on a relatively slow annealing schedule, while ∇DA can have a faster annealing schedule. • CCCP was originally proposed without annealing, but it seems that its performance can be improved by annealing. To confirm this fact, Table 12 compares the results of CCCP with and without annealing on the 10 splits of all the data sets. The results provided in Table 8 are with annealing for all methods. 5.4.5 ROBUSTNESS TO C LASS R ATIO E STIMATE Several results reported in the previous tables are better than, for instance, the results in Chapelle and Zien (2005); Chapelle et al. (2006a). This is because (a) we took into account the knowledge of the true class ratio among the unlabeled examples, and (b) we enforced the constraint (3) exactly (with the dichotomic search described at the beginning of Section 4). 226 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Of course, in practice the true number of positive points is unknown. One can estimate it from the labeled points as: ! 1 1 l r= ∑ yi + 1 . 2 l i=1 Table 13 presents the generalization performances in this more realistic context where class ratios are estimated as above and original soft balance constraint is used where applicable (recall the discussion in Section 4). g50c Text Uspst Isolet Coil20 Coil3 2moons ∇S3 VM 7.2 6.8 24.1 48.4 35.4 64.4 62.2 cS3 VM 6.6 5 41.5 58.3 51.5 59.7 33.7 CCCP 6.7 12.8 24.3 43.8 34.5 59.4 55.6 S3 VMlight 7.5 9.2 24.4 36 25.3 56.7 68.8 ∇DA 8.4 8.1 29.8 46 12.3 61.7 22.5 Newton 5.8 6.1 25 45.5 25.4 62.9 8.9 SVM 9.1 23.1 24.2 38.4 26.2 51.8 44.4 Table 13: Constant r estimated from the labeled set. For methods of Section 4, the original constraint (12) is used; there is no dichotomic search (see beginning of Section 4). 5.5 Transductive Versus Semi-supervised Learning S3 VMs were introduced as Transductive SVMs, originally designed for the task of directly estimating labels of unlabeled points. However S3 VMs provide a decision boundary in the entire input space, and so they can provide labels of unseen test points as well. For this reason, we believe that S3 VMs are inductive semi-supervised methods and not strictly transductive. A discussion on the differences between semi-supervised learning and transduction can be found in Chapelle et al. (2006b, Chapter 25).15 We expect S3 VMs to perform equally well on the unlabeled set and on an unseen test set. To test this hypothesis, we took out 25% of the unlabeled set that we used as a unseen test set. We performed 420 experiments (6 algorithms, 7 data sets and 10 splits). Based on these 420 pairs of error rates, we did not observe a significant difference at the 5% confidence level. Also, for each of the 7 data sets (resp 6 algorithms), there was no statistical significant differences in the 60 (resp 70) pairs of error rates. Similar experiments were performed by Collobert et al. (2006, Section 6.2). Based on 10 splits, the error rate was found to be better on the unlabeled set than on the test set. The authors made the hypothesis that when the test and training data are not identically distributed, transduction can be helpful. Indeed, because of small sample effect, the unlabeled and test set could appear as not coming from the same distribution (this is even more likely in high dimension). We considered the g50c data set and biased the split between unlabeled and test set such that there is an angle between the principal directions of the two sets. Figure 10 shows a correlation 15. Paper is available at http://www.kyb.tuebingen.mpg.de/ssl-book/discussion.pdf. 227 C HAPELLE , S INDHWANI AND K EERTHI 0.05 Error difference 0 −0.05 −0.1 −0.15 −0.2 0.4 0.5 0.6 0.7 0.8 Angle 0.9 1 1.1 Figure 10: The test set of g50c has been chosen with a bias such that there is an angle between the principal directions of the unlabeled and test sets. The figure shows the difference in error rates (negative means better error rate on the unlabeled set) as a function of the angle for different random (biased) splits. between the difference in error rates and this angle: the error on the test set deteriorates as the angle increases. This confirms the hypothesis stated above. 5.6 Role of Data Geometry There is empirical evidence that S3 VMs exhibit poor performances on “manifold” type data or when the data has many distinct sub-clusters (Chapelle and Zien, 2005). We now explore this issue and propose an hybrid method combining the S3 VM and LapSVM (Sindhwani et al., 2005). 5.6.1 E FFECT OF M ULTIPLE C LUSTERS In Chapelle et al. (2006a), cS3 VM exhibited poor performance in multiclass problems with oneversus-the-rest training, but worked well on pairwise binary problems that were constructed (using all the labels) from the same multiclass data sets. Note that in practice it is not possible to do semi-supervised one-vs-one multiclass training because the labels of the unlabeled points are truly unknown. We compared different algorithms on pairwise binary classification tasks for all the multiclass problems. Results are shown in the first 4 rows of Table 14. Except on Coil3, most S 3 VM algorithms show improved performances. There are two candidate explanations for this behavior: 1. The binary classification problems in one-versus-the-rest are unbalanced and this creates difficulties for S3 VMs. 228 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Uspst Isolet Coil20 Coil3 Uspst2 ∇S3 VM 1.9 4.8 2.8 45.8 15.6 cS3 VM 2 11.8 3.9 48.7 25.7 CCCP 2.8 5.5 2.9 40.3 16.6 S3 VMlight 2.9 5.7 3.1 41.3 16.1 ∇DA 4 5 2.7 44.5 20 Newton 1.9 6.3 2.4 47.2 16 SVM 5.3 7.3 3.3 36.7 17.7 SVM-5cv 0.9 1.2 0 0 3 Table 14: Experiments in a pairwise binary setting. Uspst2 is the same set as Uspst but where the task is to classify digits 0 to 4 versus 5 to 9. 2. In a one-versus-the-rest approach, the negative class is the concatenation of several classes and is thus made up of several clusters. This might accentuate the local minimum problem of S3 VMs. To test these hypothesis, we created a binary version of Uspst by classifying digits 0 to 4 versus 5 to 9: this data set (Uspst2 in Table 14) is balanced but each class is made of several clusters. The fact that the S3 VMs algorithms were not able to perform significantly better than the SVM baseline tends to accredit the second hypothesis: S3 VM results deteriorate when there are several clusters per class. 5.6.2 H YBRID S 3 VM- GRAPH M ETHODS Recall that the results in Table 8 boost the empirical evidence that S 3 VMs do not return stateof-the-art performance on “manifold”-like data sets. On these data sets the cluster assumption is presumably satisfied under an intrinsic “geodesic” distance rather than the original euclidean distance between data points. It is reasonable, therefore, to attempt to combine S 3 VMs with choices of kernels that conform to the particular geometry of the data. LapSVM (Sindhwani et al., 2005) is a popular semi-supervised algorithm in which a kernel function is constructed from the Laplacian of an adjacency graph that models the data geometry; this kernel is then used with a standard SVM. This procedure was shown to be very effective on data sets with a manifold structure. In Table 15 we report results with a hybrid S 3 VM-graph method: the kernel is constructed as in LapSVM,16 but then it is plugged into S3 VMlight . Such a hybrid was first described in Chapelle et al. (2006a) where cS3 VM was combined with LapSVM. The results of this hybrid method is very satisfactory, often outperforming both LapSVM and S3 VMlight . Such a method complements the strengths of both S3 VM and Graph-based approaches: the adds robustness to the construction of the graph, while the graph enforces the right cluster structure to alleviate local minima problems in S3 VMs. We believe that this kind of technique is probably one of the most robust and powerful way for a semi-supervised learning problem. S3 VM 16. Hyperparameters were chosen based on experiments in Sindhwani et al. (2005) without any extensive tuning. 229 C HAPELLE , S INDHWANI AND K EERTHI g50c Text Uspst Isolet Coil20 Coil3 2moons LapSVM 6.4 11 11.4 41.2 11.9 20.6 7.8 Exact r S3 VMlight 6.2 8.1 15.5 30.0 25.3 56.7 68.8 (Table 8 setting) LapSVM-S3 VMlight 4.6 8.3 8.8 46.5 12.5 17.9 5.1 Estimated r S3 VMlight 7.5 9.2 24.4 36.0 25.3 56.7 68.8 (Table 13 setting) LapSVM-S3 VMlight 6.1 9.0 19.6 49.0 12.5 17.9 5.1 Table 15: Comparison of a Graph-based method, LapSVM (Sindhwani et al., 2005), with S3 VMlight and hybrid LapSVM-S3 VMlight results under the settings of Table 8 and 13. 6. A Note on Computational Complexity Even though a detailed comparison of the computational complexity of the different algorithms is out of the scope of this paper, we can still give the following rough picture. First, all methods use annealing and so the complexity depends on the number of annealing steps. This dependency is probably sublinear because when the number of steps is large, retraining after each (small) step is less costly. We can divide the methods in two categories: 1. Methods whose complexity is of the same order as that of a standard SVM trained with the predicted labels of the unlabeled points, which is O(n 3sv + n2 ) where nsv is the number of points which are in a non-linear part of the loss function. This is clearly the case for S 3 VMlight since it relies on an SVM optimization. Note that the training time of this algorithm can be sped up by swapping labels in “batch” rather than one by one (Sindhwani and Keerthi, 2006). CCCP is also in this category as the optimization problem it has to solve at each step is very close to an SVM. Finally, even if the Newton method is not directly solved via SVM, its complexity is also O(n3sv + n2 ) and we include it in this category. For both CCCP and Newton, the possibility of having a flat part in the middle of the loss function can reduce n sv and thus the complexity. We have indeed observed with the Newton method that the convergence can be an order of magnitude faster when the loss includes this flat part in the middle (see Table 3). 2. Gradient based methods, namely ∇S3 VM, cS3 VM and ∇DA do not have any linear part in the objective function and so they scale as O(n3 ). Note that it is possible to devise a gradient based method and a loss function that contains some linear parts. The complexity of such an algorithm would be O(nn2sv + n2 ). The original DA algorithm alternates between optimization of w and p u and can be understood as block coordinate optimization. We found that it was significantly slower than the other algorithms; its direct gradient-based optimization counterpart, ∇DA, is usually much faster. Finally, note that even if the algorithms of the second category have a complexity of O(n 3 ), one can find an approximate solution by reducing the dimensionality of the problem from n to m and get a complexity of O(nm2 ). For instance, Chapelle et al. (2006a) reports a speed-up of 100 times without loss in accuracy for cS3 VM. 230 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES Ultimately a semi-supervised learning algorithm should be able to handle data sets with millions of unlabeled points. The best way of scaling up S3 VMs is still an open question and should be the topic of future research. 7. Conclusion When practical S3 VM implementations fail to give good results on a problem, one might suspect that either: (a) the cluster assumption does not hold; or, (b) the cluster assumption holds but local minima problems are severe; or, (c) the S3 VM objective function is unable to implement the cluster assumption. We began our empirical study by benchmarking current S 3 VM implementations against a global optimizer. Our results (see Section 5.3) narrowed the cause for performance loss down to suboptimal local minima, and established a correlation between generalization performance and the S3 VM objective function. For problems where the cluster assumption is true, we expect the S3 VM objective to indeed be an appropriate quantity to minimize. Due to non-convexity however, this minimization is not completely straightforward—an assortment of optimization techniques have been brought to bear on this problem with varying degrees of success. In this paper, we have reviewed these techniques and studied them empirically, taking several subtle differences into account. In a neutral experimental protocol, we were unable to identify any single technique as being consistently superior to another in terms of generalization performance. We believe better methods are still needed to optimize the S3 VM objective function. While S3 VMs return good performance on textual data sets, they are currently not competitive with graph-methods on domains such as image classification often characterized by multiple, highly non-Gaussian clusters. A particularly promising class of techniques (see Section 5.6.2) is based on combining S3 VMs with graph methods. S3 VMs have been sparingly explored in domains other than text and image classification. New application domains may provide additional insight into the behavior of S 3 VM methods while enhancing their general appeal for semi-supervised learning. Another major theme is the extension of S3 VMs for structured output problems, possibly building on one of several recent lines of work for handling complex supervised learning tasks. A first step towards such an extension would require a clear statement of the cluster assumption applicable to the semi-supervised structured output setting. These are fertile areas for future research. References H. Abdi. Kendall rank correlation. In N.J. Salkind, editor, Encyclopedia of Measurement and Statistics. SAGE, 2006. A. Astorino and A. Fuduli. Nonsmooth optimization techniques for semi-supervised classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2135–2142, 2007. Y. Bengio and Y. Grandvalet. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, volume 17, 2004. K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 12, 1998. 231 C HAPELLE , S INDHWANI AND K EERTHI T. De Bie and N. Cristianini. Semi-supervised learning using semi-definite programming. In O. Chapelle, B. Schoëlkopf, and A. Zien, editors, Semi-supervised Learning. MIT Press, 2006. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, New York, NY, 1992. O. Chapelle. Training a support vector machine in the primal. Neural Computation, 19(5):1155– 1178, 2007. O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Tenth International Workshop on Artificial Intelligence and Statistics, 2005. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131–159, 2002. O. Chapelle, M. Chi, and A. Zien. A continuation method for semi-supervised SVMs. In International Conference on Machine Learning, 2006a. O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006b. URL http://www.kyb.tuebingen.mpg.de/ssl-book. O. Chapelle, V. Sindhwani, and S. Keerthi. Branch and bound for semi-supervised support vector machines. In Advances in Neural Information Processing Systems, 2006c. Y. Chen, G. Wang, and S. Dong. Learning with progressive transductive support vector machine. Pattern Recognition Letter, 24(12):1845–1855, 2003. R. Cole, Y. Muthusamy, and M. Fanty. The ISOLET spoken letter database. Technical Report CS/E 90-004, Oregon Graduate Institute, 1990. R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine Learning Research, 7:1687–1712, 2006. R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, 1987. G. Fung and O. Mangasarian. Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15:29–44, 2001. T. Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, 1999. H. Liu and S.-T. Huang. Fuzzy transductive support vector machines for hypertext classification. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 12(1):21–36, 2004. S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-20). Technical Report CUCS-005-96, Columbia Univ., USA, 1996. B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 232 O PTIMIZATION T ECHNIQUES FOR S EMI -S UPERVISED S UPPORT V ECTOR M ACHINES M. Seeger. A taxonomy of semi-supervised learning methods. In O. Chapelle, B. Sch ölkopf, and A. Zien, editors, Semi-Supervised Lerning. MIT Press, 2006. M. Silva, T. Maia, and A. Braga. An evolutionary approach to transduction in support vector machines. In Fifth International Conference on Hybrid Intelligent Systems, pages 329–334, 2005. V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear SVMs. In SIGIR, 2006. V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: From transductive to semisupervised learning. In International Conference on Machine Learning, 2005. V. Sindhwani, S. Keerthi, and O. Chapelle. Deterministic annealing for semi-supervised kernel machines. In International Conference on Machine Learning, 2006. M. Szummer and T. Jaakkola. Partially labeled classification with markov random walks. In Advances in Neural Information Processing Systems, volume 14, 2001. V. Vapnik and A. Sterin. On structural risk minimization or overall risk in a problem of pattern recognition. Automation and Remote Control, 10(3):1495–1503, 1977. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., New York, 1998. L. Wang, X. Shen, and W. Pan. On transductive support vector machines. In J. Verducci, X. Shen, and J. Lafferty, editors, Prediction and Discovery. American Mathematical Society, 2007. W. Wapnik and A. Tscherwonenkis. Theorie der Zeichenerkennung. Akademie Verlag, Berlin, 1979. Z. Wu. The effective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation. SIAM Journal on Optimization, 6(3): 748–768, 1996. L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Advances in Neural Information Processing Systems, 2004. A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003. X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report 02-107, CMU-CALD, 2002. 233