[arXiv]

Learning with a Drifting Target Concept
Steve Hanneke, Varun Kanade, and Liu Yang
arXiv:1505.05215v1 [cs.LG] 20 May 2015
1
2
Princeton, NJ USA.
[email protected]
Département d’informatique, École normale supérieure, Paris, France.
[email protected]
3
IBM T.J. Watson Research Center, Yorktown Heights, NY USA.
[email protected]
Abstract. We study the problem of learning in the presence of a drifting target concept. Specifically, we provide bounds on the error rate at a
given time, given a learner with access to a history of independent samples labeled according to a target concept that can change on each round.
One of our main contributions is a refinement of the best previous results
for polynomial-time algorithms for the space of linear separators under
a uniform distribution. We also provide general results for an algorithm
capable of adapting to a variable rate of drift of the target concept. Some
of the results also describe an active learning variant of this setting, and
provide bounds on the number of queries for the labels of points in the
sequence sufficient to obtain the stated bounds on the error rates.
1
Introduction
Much of the work on statistical learning has focused on learning settings in
which the concept to be learned is static over time. However, there are many
application areas where this is not the case. For instance, in the problem of
face recognition, the concept to be learned actually changes over time as each
individual’s facial features evolve over time. In this work, we study the problem
of learning with a drifting target concept. Specifically, we consider a statistical
learning setting, in which data arrive i.i.d. in a stream, and for each data point,
the learner is required to predict a label for the data point at that time. We
are then interested in obtaining low error rates for these predictions. The target
labels are generated from a function known to reside in a given concept space,
and at each time t the target function is allowed to change by at most some
distance ∆t : that is, the probability the new target function disagrees with the
previous target function on a random sample is at most ∆t .
This framework has previously been studied in a number of articles. The
classic works of [HL91,HL94,BH96,Lon99,BBDK00] and [BL97] together provide
a general analysis of a very-much related setting. Though the objectives in these
works are specified slightly differently, the results established there are easily
translated into our present framework, and we summarize many of the relevant
results from this literature in Section 3.
2
Steve Hanneke, Varun Kanade, and Liu Yang
While the results in these classic works are general, the best guarantees on
the error rates are only known for methods having no guarantees of computational efficiency. In a more recent effort, the work of [CMEDV10] studies this
problem in the specific context of learning a homogeneous linear separator, when
all the ∆t values are identical. They propose a polynomial-time algorithm (based
on the modified Perceptron algorithm of [DKM09]), and prove a bound on the
number of mistakes it makes as a function of the number of samples, when the
data distribution satisfies a certain condition called “λ-good” (which generalizes a useful property of the uniform distribution on the origin-centered unit
sphere). However, their result is again worse than that obtainable by the known
computationally-inefficient methods.
Thus, the natural question is whether there exists a polynomial-time algorithm achieving roughly the same guarantees on the error rates known for the
inefficient methods. In the present work, we resolve this question in the case
of learning homogeneous linear separators under the uniform distribution, by
proposing a polynomial-time algorithm that indeed achieves roughly the same
bounds on the error rates known for the inefficient methods in the literature.
This represents the main technical contribution of this work.
We also study the interesting problem of adaptivity of an algorithm to the
sequence of ∆t values, in the setting where ∆t may itself vary over time. Since
the values ∆t might typically not be accessible in practice, it seems important
to have learning methods having no explicit dependence on the sequence ∆t .
We propose such a method below, and prove that it achieves roughly the same
bounds on the error rates known for methods in the literature which require
direct access to the ∆t values. Also in the context of variable ∆t sequences, we
discuss conditions on the sequence ∆t necessary and sufficient for there to exist
a learning method guaranteeing a sublinear rate of growth of the number of
mistakes.
We additionally study an active learning extension to this framework, in
which, at each time, after making its prediction, the algorithm may decide
whether or not to request access to the label assigned to the data point at
that time. In addition to guarantees on the error rates (for all times, including
those for which the label was not observed), we are also interested in bounding
the number of labels we expect the algorithm to request, as a function of the
number of samples encountered thus far.
2
Definitions and Notation
Formally, in this setting, there is a fixed distribution P over the instance space X ,
and there is a sequence of independent P-distributed unlabeled data X1 , X2 , . . ..
There is also a concept space C, and a sequence of target functions h∗ =
{h∗1 , h∗2 , . . .} in C. Each t has an associated target label Yt = h∗t (Xt ). In this
context, a (passive) learning algorithm is required, on each round t, to produce a classifier ĥt based on the observations (X1 , Y1 ), . . . , (Xt−1 , Yt−1 ), and
we denote by Ŷt = ĥt (Xt ) the corresponding prediction by the algorithm for
Learning with a Drifting Target Concept
3
the label of Xt . For any classifier h, we define ert (h) = P(x : h(x) 6= h∗t (x)).
We also say the algorithm makes a “mistake” on instance Xt if Ŷt 6= Yt ; thus,
ert (ĥt ) = P(Ŷt 6= Yt |(X1 , Y1 ), . . . , (Xt−1 , Yt−1 )).
For notational convenience, we will suppose the h∗t sequence is chosen independently from the Xt sequence (i.e., h∗t is chosen prior to the “draw” of
X1 , X2 , . . . ∼ P), and is not random.
In each of our results, we will suppose h∗ is chosen from some set S of
sequences in C. In particular, we are interested in describing the sequence h∗ in
terms of the magnitudes of changes in h∗t from one time to the next. Specifically,
for any sequence ∆ = {∆t }∞
t=2 in [0, 1], we denote by S∆ the set of all sequences
h∗ in C such that, ∀t ∈ N, P(x : ht (x) 6= ht+1 (x)) ≤ ∆t+1 .
Throughout this article, we denote by d the VC dimension of C [VC71], and
we suppose C is such that 1 ≤ d < ∞. Also, for any x ∈ R, define Log(x) =
ln(max{x, e}).
3
Background: (ǫ, S)-Tracking Algorithms
As mentioned, the classic literature on learning with a drifting target concept is
expressed in terms of a slightly different model. In order to relate those results to
our present setting, we first introduce the classic setting. Specifically, we consider
a model introduced by [HL94], presented here in a more-general form inspired
by [BBDK00]. For a set S of sequences {ht }∞
t=1 in C, and a value ǫ > 0, an
algorithm A is said to be (ǫ, S)-tracking if ∃tǫ ∈ N such that, for any choice of
h∗ ∈ S, ∀T ≥ tǫ , the prediction ŶT produced by A at time T satisfies
P ŶT 6= YT ≤ ǫ.
Note that the value of the probability in the above expression may be influenced
by {Xt }Tt=1 , {h∗t }Tt=1 , and any internal randomness of the algorithm A.
The focus of the results expressed in this classical model is determining sufficient conditions on the set S for there to exist an (ǫ, S)-tracking algorithm,
along with bounds on the sufficient size of tǫ . These conditions on S typically
take the form of an assumption on the drift rate, expressed in terms of ǫ. Below,
we summarize several of the strongest known results for this setting.
3.1
Bounded Drift Rate
The simplest, and perhaps most elegant, results for (ǫ, S)-tracking algorithms
is for the set S of sequences with a bounded drift rate. Specifically, for any
∆ ∈ [0, 1], define S∆ = S∆ , where ∆ is such that ∆t+1 = ∆ for every t ∈ N.
The study of this problem was initiated in the original work of [HL94]. The
best known general results are due to [Lon99]: namely, that for some ∆ǫ =
Θ(ǫ2 /d), for every ǫ ∈ (0, 1], there exists an (ǫ, S∆ )-tracking algorithm for all
values of ∆ ≤ ∆ǫ .4 This refined an earlier result of [HL94] by a logarithmic
4
In fact, [Lon99] also allowed the distribution P to vary gradually over time. For
simplicity, we will only discuss the case of fixed P.
4
Steve Hanneke, Varun Kanade, and Liu Yang
factor. [Lon99] further argued that this result can be achieved with tǫ = Θ(d/ǫ).
The algorithm itself involves a beautiful modification of the one-inclusion graph
prediction strategy of [HLW94]; since its specification is somewhat involved, we
refer the interested reader to the original work of [Lon99] for the details.
3.2
Varying Drift Rate: Nonadaptive Algorithm
In addition to the concrete bounds for the case h∗ ∈ S∆ , [HL94] additionally
present an elegant general
result.
Pm Specifically, they argue that, for any ǫ > 0,
and any m = Ω dǫ Log 1ǫ , if i=1 P(x : h∗i (x) 6= h∗m+1 (x)) ≤ mǫ/24, then for
Pm
ĥ = argminh∈C i=1 1[h(Xi ) 6= Yi ], P(ĥ(Xm+1 ) 6= h∗m+1 (Xm+1 )) ≤ ǫ.5 This result immediately inspires an algorithm A which, at everyPtime t, chooses a value
mt ≤ t−1, and predicts Ŷt = ĥt (Xt ), for ĥt = argminh∈C t−1
i=t−mt 1[h(Xi ) 6= Yi ].
We are then interested in choosing mt to minimize the value of ǫ obtainable via
the result of [HL94]. However, that method is based on the values P(x : h∗i (x) 6=
h∗t (x)), which would typically not be accessible to the algorithm. However, suppose instead we have access to a sequence ∆ such that h∗ ∈ S∆P
. In this case,
t
we could approximate P(x : h∗i (x) 6= h∗t (x)) by its upper bound
j=i+1 ∆j . In
this case,
we
are
interested
choosing
m
to
minimize
the
smallest
value
of ǫ such
t
Pt−1
Pt
that i=t−mt j=i+1 ∆j ≤ mt ǫ/24 and mt = Ω dǫ Log 1ǫ . One can easily verify
that this minimum is obtained at a value


t−1
t
X
X
dLog(m/d)
1
,
∆j +
mt = Θ argmin
m
m≤t−1 m i=t−m j=i+1
and via the result of [HL94] (applied to the sequence Xt−mt , . . . , Xt ) the resulting
algorithm has


t−1
t
X
X
1
dLog(m/d)
.
P Ŷt 6= Yt ≤ O  min
∆j +
(1)
1≤m≤t−1 m
m
i=t−m j=i+1
As a special case, if every
p t has ∆t = ∆ for a fixed value ∆ ∈ [0, 1], this
result recovers the bound d∆Log(1/∆), which is only slightly larger than that
obtainable from the best bound of [Lon99]. It also applies to far more general
and more intersting sequences ∆, including some that allow periodic large jumps
(i.e., ∆t = 1 for some indices t), others where the sequence ∆t converges to 0, and
so on. Note, however, that the algorithm obtaining this bound directly depends
on the sequence ∆. One of the contributions of the present work is to remove this
requirement, while maintaining essentially the same bound, though in a slightly
different form.
5
They in fact prove a more general result, which also applies to methods approximately minimizing the number of mistakes, but for simplicity we will only discuss
this basic version of the result.
Learning with a Drifting Target Concept
3.3
5
Computational Efficiency
[HL94] also proposed a reduction-based approach, which sometimes yields computationally efficient methods, though the tolerable ∆ value is smaller. Specifically, given any (randomized)
polynomial-time algorithm A that produces a clasP
1
[h(x
) 6= yt ] = 0 for any sequence (x1 , y1 ), . . . , (xm , ym )
sifier h ∈ C with m
t
t=1
for which such a classifier h exists (called the consistency problem), they propose
′
a polynomial-time
algorithm
that is (ǫ, S∆ )-tracking for all values of ∆ ≤ ∆ǫ ,
2
ǫ
where ∆′ǫ = Θ d2 Log(1/ǫ)
. This is slightly worse (by a factor of dLog(1/ǫ))
than the drift rate tolerable by the (typically inefficient) algorithm mentioned
above. However, it does sometimes yield computationally-efficient methods. For
instance, there are known polynomial-time algorithms for the consistency problem for the classes of linear separators, conjunctions, and axis-aligned rectangles.
3.4
Lower Bounds
[HL94] additionally prove lower bounds for specific concept spaces: namely, linear
separators and axis-aligned rectangles. They specifically argue that, for C a
concept space
BASICn = {∪ni=1 [i/n, (i + ai )/n) : a ∈ [0, 1]n }
on [0, 1], under P the uniform distribution on [0, 1], for any ǫ ∈ [0, 1/e2 ] and
∆ǫ ≥ e4 ǫ2 /n, for any algorithm A, and any T ∈ N, there exists a choice
ofh∗ ∈ S∆ǫ such that the prediction ŶT produced by A at time T satisfies
P ŶT 6= YT > ǫ. Based on this, they conclude that no (ǫ, S∆ǫ )-tracking algorithm exists. Furthermore, they observe that the space BASICn is embeddable
in many commonly-studied concept spaces, including halfspaces and axis-aligned
rectangles in Rn , so that for C equal to either of these spaces, there also is no
(ǫ, S∆ǫ )-tracking algorithm.
4
Adapting to Arbitrarily Varying Drift Rates
This section presents a general bound on the error rate at each time, expressed
as a function of the rates of drift, which are allowed to be arbitrary. Mostimportantly, in contrast to the methods from the literature discussed above, the
method achieving this general result is adaptive to the drift rates, so that it
requires no information about the drift rates in advance. This is an appealing
property, as it essentially allows the algorithm to learn under an arbitrary sequence h∗ of target concepts; the difficulty of the task is then simply reflected
in the resulting bounds on the error rates: that is, faster-changing sequences of
target functions result in larger bounds on the error rates, but do not require a
change in the algorithm itself.
6
Steve Hanneke, Varun Kanade, and Liu Yang
4.1
Adapting to a Changing Drift Rate
Recall that the method yielding (1) (based on the work of [HL94]) required
access to the sequence ∆ of changes to achieve the stated guarantee on the
expected number of mistakes. That method is based on choosing a classifier
to predict Ŷt by minimizing the number of mistakes among the previous mt
samples, where mt is a value chosen based on the ∆ sequence. Thus, the key to
modifying this algorithm to make it adaptive to the ∆ sequence is to determine
a suitable choice of mt without reference to the ∆ sequence. The strategy we
adopt here is to use the data to determine an appropriate value m̂t to use.
Roughly (ignoring logarithmic factors for now), the insight that enables us to
achieve this feat is that, for the mt used in the above strategy, one can show that
P
t−1
∗
i=t−mt 1[ht (Xi ) 6= Yi ] is roughly Õ(d), and that making the prediction Ŷt with
any h ∈ C with roughly Õ(d) mistakes on these samples will suffice to obtain the
stated bound on the error rate (up to logarithmic factors). Thus, if we replace
Pt−1
mt with the largest value m for which minh∈C i=t−m 1[h(Xi ) 6= Yi ] is roughly
Õ(d), then the above observation implies m ≥ mt . This then implies that, for
Pt−1
Pt−1
ĥ = argminh∈C i=t−m 1[h(Xi ) 6= Yi ], we have that i=t−mt 1[ĥ(Xi ) 6= Yi ] is
also roughly Õ(d), so that the stated bound on the error rate will be achieved
(aside from logarithmic factors) by choosing ĥt as this classifier ĥ. There are
a few technical modifications to this argument needed to get the logarithmic
factors to work out properly, and for this reason the actual algorithm and proof
below are somewhat more involved. Specifically, consider the following algorithm
(the value of the universal constant K ≥ 1 will be specified below).
0. For T = 1, 2, . . . 1.
2.
Let m̂T = max m ∈ {1, . . . , T −1} : min max
′
Let ĥT = argmin max
′
h∈C
m ≤m̂T
PT −1
1
h∈C m ≤m
[h(Xt )6=Yt ]
t=T −m′
dLog(m′ /d)+Log(1/δ)
PT −1
1
[h(Xt )6=Yt ]
t=T −m′
dLog(m′ /d)+Log(1/δ)
<K
Note that the classifiers ĥt chosen by this algorithm have no dependence on
∆, or indeed anything other than the data {(Xi , Yi ) : i < t}, and the concept
space C.
Theorem 1. Fix any δ ∈ (0, 1), and let A be the above algorithm. For any
sequence ∆ in [0, 1], for any P and any choice of h∗ ∈ S∆ , for every T ∈ N\{1},
with probability at least 1 − δ,


T
−1
T
X
X
dLog(m/d)
+
Log(1/δ)
1
.
∆j +
erT ĥT ≤ O  min
1≤m≤T −1 m
m
j=i+1
i=T −m
Before presenting the proof of this result, we first state a crucial lemma, which
follows immediately from a classic result of [Vap82,Vap98], combined with the
fact (from [Vid03], Theorem 4.5) that the VC dimension of the collection of sets
{{x : h(x) 6= g(x)} : h, g ∈ C} is at most 10d.
Learning with a Drifting Target Concept
7
Lemma 1. There exists a universal constant c ∈ [1, ∞) such that, for any class
C of VC dimension d, ∀m ∈ N, ∀δ ∈ (0, 1), with probability at least 1 − δ, every
h, g ∈ C have
m
1 X
1[h(Xt ) 6= g(Xt)]
P(x : h(x) 6= g(x)) −
m t=1
v
!
u
m
u 1 X
dLog(m/d) + Log(1/δ)
1[h(Xt ) 6= g(Xt )]
≤ ct
m t=1
m
+c
dLog(m/d) + Log(1/δ)
.
m
We are now ready for the proof of Theorem 1. For the constant K in the
algorithm, we will choose K = 145c2 , for c as in Lemma 1.
Proof (Proof of Theorem 1). Fix any T ∈ N with T ≥ 2, and define
(
m∗T = max m ∈ {1, . . . , T − 1} : ∀m′ ≤ m,
T
−1
X
t=T −m′
)
1[h∗T (Xt ) 6= Yt ] < K(dLog(m′ /d) + Log(1/δ)) .
Note that
T
−1
X
t=T −m∗
T
1[h∗T (Xt ) 6= Yt ] ≤ K(dLog(m∗T /d) + Log(1/δ)),
(2)
and also note that (since h∗T ∈ C) m̂T ≥ m∗T , so that (by definition of m̂T and
ĥT )
T
−1
X
1[ĥT (Xt ) 6= Yt ] ≤ K(dLog(m∗T /d) + Log(1/δ))
t=T −m∗
T
as well. Therefore,
T
−1
X
1[h∗T (Xt ) 6= ĥT (Xt )] ≤
t=T −m∗
T
T
−1
X
1[h∗T (Xt ) 6= Yt ] +
t=T −m∗
T
T
−1
X
1[Yt 6= ĥT (Xt )]
t=T −m∗
T
≤ 2K(dLog(m∗T /d) + Log(1/δ)).
Thus, by Lemma 1, for each m ∈ N, with probability at least 1 − δ/(6m2 ), if
m∗T = m, then
√
dLog(m∗T /d) + Log(6(m∗T )2 /δ)
P(x : ĥT (x) 6= h∗T (x)) ≤ (2K + c 2K + c)
.
m∗T
8
Steve Hanneke, Varun Kanade, and Liu Yang
√
2KdLog(m∗T /d), this is at most
Furthermore, since Log(6(m∗T )2 ) ≤
√
dLog(m∗T /d) + Log(1/δ)
2(K + c 2K)
.
m∗T
By aPunion bound (over values m ∈ N), we have that with probability at least
∞
1 − m=1 δ/(6m2 ) ≥ 1 − δ/3,
√
dLog(m∗T /d) + Log(1/δ)
P(x : ĥT (x) 6= h∗T (x)) ≤ 2(K + c 2K)
.
m∗T
Let us denote
m̃T =
T −1
1 X
m∈{1,...,T −1} m
T
X
argmin
∆j +
i=T −m j=i+1
dLog(m/d) + Log(1/δ)
.
m
Note that, for any m′ ∈ {1, . . . , T − 1} and δ ∈ (0, 1), if m̃T ≥ m′ , then
T −1
1 X
m∈{1,...,T −1} m
min
T
X
∆j +
i=T −m j=i+1
≥
T −1
1 X
m∈{m ,...,T −1} m
min
′
T
X
i=T −m j=i+1
dLog(m/d) + Log(1/δ)
m
∆j =
1
m′
T
−1
X
T
X
∆j ,
i=T −m′ j=i+1
while if m̃T ≤ m′ , then
T −1
1 X
m∈{1,...,T −1} m
min
T
X
i=T −m j=i+1
∆j +
dLog(m/d) + Log(1/δ)
m
dLog(m′ /d) + Log(1/δ)
dLog(m/d) + Log(1/δ)
.
=
≥
min ′
m
m′
m∈{1,...,m }
Either way, we have that
T
T −1
dLog(m/d) + Log(1/δ)
1 X X
∆j +
min
m
m∈{1,...,T −1} m
i=T −m j=i+1


−1
T
 dLog(m′ /d) + Log(1/δ) 1 TX

X
≥ min
∆
,
.
j


m′
m′
′ j=i+1
(3)
i=T −m
For any m ∈ {1, . . . , T − 1}, applying Bernstein’s inequality (see [BLM13],
equation 2.10) to the random variables 1[h∗T (Xi ) 6= Yi ]/d, i ∈ {T −m, . . . , T −1},
and again to the random variables −1[h∗T (Xi ) 6= Yi ]/d, i ∈ {T − m, . . . , T − 1},
together with a union bound, we obtain that, for any δ ∈ (0, 1), with probability
Learning with a Drifting Target Concept
9
at least 1 − δ/(3m2 ),
T −1
1 X
P(x : h∗T (x) 6= h∗i (x))
m
i=T −m
v
!
u
−1
u 1 TX
2 ln(3m2 /δ)
∗
∗
t
−
P(x : hT (x) 6= hi (x))
m
m
i=T −m
<
<
1
m
T
−1
X
i=T −m
1[h∗T (Xi ) 6= Yi ]
T −1
1 X
P(x : h∗T (x) 6= h∗i (x))
m
i=T −m
r

1 PT −1
∗
∗
i=T −m P(x : hT (x) 6= hi (x))
m
+ max
 (4/3) ln(3m2 /δ)
4 ln(3m2 /δ)
m
.
(4)
m
The left inequality implies that
T −1
1 X
P(x : h∗T (x) 6= h∗i (x)) ≤ max
m
i=T −m
(
)
T −1
8 ln(3m2 /δ)
2 X
∗
1[hT (Xi ) 6= Yi ],
.
m
m
i=T −m
Plugging this into the right inequality in (4), we obtain that
T −1
1 X
m
i=T −m
1[h∗T (Xi ) 6= Yi ] <
v
−1
u
u 1 TX
+ max t

m
i=T −m
T −1
1 X
P(x : h∗T (x) 6= h∗i (x))
m
i=T −m
1[h∗T (Xi ) 6= Yi ]
!
8 ln(3m2 /δ)
m

√
2
32 ln(3m /δ) 
,
.

m
By a union bound, this holds simultaneously for all m ∈ {1, . . . , T − 1} with
PT −1
probability at least 1 − m=1 δ/(3m2 ) > 1 − (2/3)δ. Note that, on this event,
we obtain
T −1
T −1
1 X
1 X
1[h∗T (Xi ) 6= Yi ]
P(x : h∗T (x) 6= h∗i (x)) >
m
m
i=T −m
i=T −m
v

!
u
√
T
−1
u 1 X
2
2
32 ln(3m /δ) 
8 ln(3m /δ)
− max t
1[h∗T (Xi ) 6= Yi ]
,
.


m
m
m
i=T −m
In particular, taking m = m∗T , and invoking maximality of m∗T , if m∗T < T − 1,
the right hand side is at least
√ dLog(m∗T /d) + Log(1/δ)
.
(K − 6c K)
m∗T
10
Steve Hanneke, Varun Kanade, and Liu Yang
PT −1 PT
PT −1
1
1
∗
∗
Since m
i=T −m
i=T −m P(x : hT (x) 6= hi (x)), taking
j=i+1 ∆j ≥ m
2
∗
K = 145c , we have that with probability at least 1 − δ, if mT < T − 1, then
T −1
T
1 X X
dLog(m/d) + Log(1/δ)
∆j +
m∈{1,...,T −1} m
m
i=T −m j=i+1


T
−1
T
 dLog(m∗ /d) + Log(1/δ) 1

X
X
√
T
∆
≥ 10(K + c 2K) min
,
j


m∗T
m∗T
∗
√
10(K + c 2K)
min
i=T −mT j=i+1
√
dLog(m∗T /d) + Log(1/δ)
≥ 10(K + c 2K)
m∗T
≥ P(x : ĥT (x) 6= h∗T (x)).
Furthermore, if m∗T = T −1, then we trivially have (on the same 1−δ probability
event as above)
√
10(K + c 2K)
T −1
1 X
m∈{1,...,T −1} m
√
≥ 10(K + c 2K)
min
T
X
∆j +
i=T −m j=i+1
dLog(m/d) + Log(1/δ)
m
dLog(m/d) + Log(1/δ)
m
√
dLog((T − 1)/d) + Log(1/δ)
= 10(K + c 2K)
T −1
√
dLog(m∗T /d) + Log(1/δ)
= 10(K + c 2K)
≥ P(x : ĥT (x) 6= h∗T (x)).
m∗T
min
m∈{1,...,T −1}
⊓
⊔
4.2
Conditions Guaranteeing a Sublinear Number of Mistakes
One immediate implication of Theorem 1 is that, if the sum of ∆t values grows
sublinearly, then there exists an algorithm achieving an expected number of
mistakes growing sublinearly in the number of predictions. Formally, we have
the following corollary.
PT
Corollary 1. If t=1 ∆t = o(T ), then there exists an algorithm A such that,
for every P and every choice of h∗ ∈ S∆ ,
#
" T
i
X h
1 Ŷt 6= Yt = o(T ).
E
t=1
Proof. For every T ∈ N with T ≥ 2, let
T −1
1 X
1≤m≤T −1 m
m̃T = argmin
T
X
i=T −m j=i+1
∆j +
dLog(m/d) + Log(1/δT )
,
m
Learning with a Drifting Target Concept
and define δT =
1
m̃T
11
. Then consider running the algorithm A from Theorem 1,
except that in choosing m̂T and ĥT for each T , we use the above value δT in
place of δ. Then Theorem 1 implies that, for each T , with probability at least
1 − δT ,

T −1
1 X
erT (ĥT ) ≤ O  min
1≤m≤T −1 m
i=T −m

dLog(m/d) + Log(1/δT ) 
∆j +
.
m
j=i+1
T
X
Since erT (ĥT ) ≤ 1, this implies that
h
i
P ŶT 6= YT = E erT (ĥT )


T
−1
T
X
X
1
dLog(m/d)
+
Log(1/δ
)
T
 + δT
≤ O  min
∆j +
1≤m≤T −1 m
m
i=T −m j=i+1


T
T
−1
X
X
dLog(m/d)
+
Log(m)
1
,
∆j +
= O  min
1≤m≤T −1 m
m
j=i+1
i=T −m
and since x 7→ xLog(m/x) is nondecreasing for x ≥ 1, Log(m) ≤ dLog(m/d), so
that this last expression is

O
min
1≤m≤T −1
1
m
T
−1
X
T
X
∆j +
i=T −m j=i+1

dLog(m/d) 
.
m
Now note that, for any t ∈ N and m ∈ {1, . . . , t − 1},
t−1
t
t−1
t
t
X
X
1 X X
1 X
∆r ≤
∆r =
∆s .
m s=t−m r=s+1
m s=t−m r=t−m+1
r=t−m+1
Let βt (m) = max
dLog(m/d)
m
nP
t
r=t−m+1 ∆r ,
dLog(m/d)
m
o
, and note that
Pt
(5)
r=t−m+1 ∆r
+
≤ 2βt (m). Thus, combining the above with (5), linearity of expectations, and the fact that the probability of a mistake on a given round is at most
1, we obtain
#
" T
T
i
X
X h
1 Ŷt 6= Yt = O
E
t=1
t=1
min
m∈{1,...,t−1}
!
βt (m) ∧ 1 .
12
Steve Hanneke, Varun Kanade, and Liu Yang
Fixing any M ∈ N, we have that for any T > M ,
T
X
t=1
min
m∈{1,...,t−1}
βt (m) ∧ 1 ≤ M +
T
X
t=M+1
βt (M ) ∧ 1
#
t
X
dLog(M/d)
dLog(M/d)
≥
∆r
1
≤M+
M
M
r=t−M+1
t=M+1
"
#
T
t
X
X
dLog(M/d)
+
∆r >
1
M
T
X
"
t=M+1
≤M+
=
r=t−M+1
T
X
dLog(M/d)
M
T+
M
dLog(M/d)
t=M+1
dLog(M/d)
T + gM (T ),
M
t
X
∆r
r=t−M+1
where gM is a function satisfying gM (T ) = o(T ) (holding M fixed). Since this is
true of any M ∈ N, we have that
T
dLog(M/d) gM (T )
1 X
+
min
βt (m) ∧ 1 ≤ lim lim
M→∞ T →∞
T →∞ T
M
T
m∈{1,...,t−1}
t=1
lim
= lim
M→∞
so that E
hP
T
h
t=1 1 Ŷt 6= Yt
ii
dLog(M/d)
= 0,
M
= o(T ), as claimed.
⊓
⊔
PT
For many concept spaces of interest, the condition t=1 ∆t = o(T ) in Corollary 1 is also a necessary condition for any algorithm to guarantee a sublinear
number of mistakes. For simplicity, we will establish this for the class of homogeneous linear separators on R2 , with P the uniform distribution on the unit circle,
in the following theorem. This can easily be extended to many other spaces, including higher-dimensional linear separators or axis-aligned rectangles in Rk , by
embedding an analogous setup into those spaces.
Theorem 2. If X = {x ∈ R2 : kxk = 1}, P is Uniform(X ), and C = {x 7→
21[w · x ≥ 0] − 1 : w ∈ R2 , kwk = 1} is the class of homogeneous linear separators,h then for
∆ in [0, 1], there exists an algorithm A such
ii
h any sequence
PT
that E
= o(T ) for every choice of h∗ ∈ S∆ if and only if
t=1 1 Ŷt 6= Yt
PT
t=1 ∆t = o(T ).
Proof. The “if” part follows immediately
from Corollary 1. For the “only if”
PT
part, suppose ∆ is such that t=1 ∆t 6= o(T ). It suffices hto arguehthat foriiany
PT
6=
algorithm A, there exists a choice of h∗ ∈ S∆ for which E
t=1 1 Ŷt 6= Yt
o(T ). Toward this end, fix any algorithm A. We proceed by the probabilistic
Learning with a Drifting Target Concept
13
method, constructing a random sequence h∗ ∈ S∆ . Let B1 , B2 , . . . be independent Bernoulli(1/2) random variables (also independent from the unlabeled data
X1 , X2 , . . .). We define the sequence h∗ inductively. For simplicity, we will represent each classifier in polar coordinates, writing hφ (for φ ∈ R) to denote the
classifier that, for x = (x1 , x2 ), classifies x as hφ (x) = 21[x1 cos(φ) + x2 sin(φ) ≥
0] − 1; note that hφ = hφ+2π for every φ ∈ R. As a base case, start by defining
a function h∗0 = h0 , and letting φ0 = 0. Now for any t ∈ N, supposing h∗t−1
is already defined to be hφt−1 , we define φt = φt−1 + min{∆t , 1/2}πBt, and
h∗t = hφt . Note that P(x : h∗t (x) 6= h∗t−1 (x)) = min{∆t , 1/2} for every t ∈ N, so
that this inductively defines a (random) choice of h∗ ∈ S∆ .
For each t ∈ N, let Yt = h∗t (Xt ). Now fix any algorithm A, and consider the
sequence Ŷt of predictions the algorithm makes for points Xt , when the target
sequence h∗ is chosen as above. Then note that, for any t ∈ N, since Ŷt and Bt
are independent,
h i
P Ŷt 6= Yt ≥ E P Ŷt 6= Yt Ŷt , φt−1
1
≥ E P hφt−1 +min{∆t ,1/2}π (Xt ) 6= hφt−1 −min{∆t ,1/2}π (Xt ) φt−1 .
2
Furthermore, since min{∆t , 1/2}π ≤ π/2, the regions {x : hφt−1 +min{∆t ,1/2}π (x) 6=
hφt−1 (x)} and {x : hφt−1 −min{∆t ,1/2}π (x) 6= hφt−1 (x)} have zero-probability overlap (indeed, are disjoint if ∆t < 1/2), the above equals min{∆t , 1/2}.
By Fatou’s lemma, linearity of expectations, and the law of total expectation,
we have that
##
" T
"
T
X
1X 1
1[Ŷt 6= Yt ]h∗ ≥ lim sup
P Ŷt 6= Yt
E lim sup E
T →∞ T t=1
T →∞ T
t=1
≥ lim sup
T →∞
T
1X
min{∆t , 1/2}.
T t=1
P
Since Tt=1 ∆t 6= o(T ), the rightmost expression is strictly greater than zero.
Thus, it must be that, with probility strictly greater than 0,
#
" T
X
1
1[Ŷt 6= Yt ]h∗ > 0.
lim sup E
T →∞ T
t=1
In particular, this implies
ii a (nonrandom) choice of the sequence
hP thaththere exists
T
6= o(T ). Since this holds for any choice
h∗ ∈ S∆ for which E
Ŷ
=
6
Y
1
t
t
t=1
of the algorithm A, this completes the proof.
⊓
⊔
5
Polynomial-Time Algorithms for Linear Separators
In this section, we suppose ∆t = ∆ for every t ∈ N, for a fixed constant ∆ > 0,
and we consider the special case of learning homogeneous linear separators in
14
Steve Hanneke, Varun Kanade, and Liu Yang
Rk under a uniform distribution on the origin-centered unit sphere. In this case,
the analysis of [HL94] mentioned in Section√3.3 implies that it is possible to
achieve a bound on the error rate that is Õ(d ∆), using an algorithm that runs
in time poly(d, 1/∆, log(1/δ)) (and independent of t) for each prediction. This
also implies that it is possible
to achieve expected number of mistakes among T
√
predictions that is Õ(d ∆)×T . [CMEDV10]6 have since proven that a variant of
the Perceptron algorithm is capable of achieving an expected number of mistakes
Õ((d∆)1/4 ) × T .
Below, we improve on this result by showing that there
√ exists an efficient
algorithm that achieves a bound on the error rate that is Õ( d∆), as was possible
with the inefficient algorithm of [HL94,Lon99] mentioned in Section
√ 3.1. This
leads to a bound on the expected number of mistakes that is Õ( d∆) × T .
Furthermore, our approach also allows us to present the method as an active
learning algorithm, and to bound the√
expected number of queries, as a function
of the number of samples T , by Õ( d∆) × T . The technique is based on a
modification of the algorithm of [HL94], replacing an empirical risk minimization
step with (a modification of) the computationally-efficient algorithm of [ABL13].
Formally, define the class of homogeneous linear separators as the set of
classifiers hw : Rd → {−1, +1}, for w ∈ Rd with kwk = 1, such that hw (x) =
sign(w · x) for every x ∈ Rd .
5.1
An Improved Guarantee for a Polynomial-Time Algorithm
We have the following result.
Theorem 3. When C is the space of homogeneous linear separators (with d ≥ 4)
and P is the uniform distribution on the surface of the origin-centered unit sphere
in Rd , for any fixed ∆ > 0, for any δ ∈ (0, 1/e), there is an algorithm that runs
in time poly(d, 1/∆, log(1/δ)) for each time t, such that for any h∗ ∈ S∆ , for
every sufficiently large t ∈ N, with probability at least 1 − δ,
s
!
1
∆d log
.
ert (ĥt ) = O
δ
√
∆d∧1/e, the
expected number of mistakes
1
∆d log ∆d
among the first T instances is O
T . Furthermore, the algorithm
can be run as an active learning algorithm, in which case, for this choice of δ, the
expected
requested by the algorithm among the first T instances
√ number of labels
3/2
1
∆d log
is O
∆d T .
Also, running this algorithm withq
δ=
6
This work in fact studies a much broader model of drift, which in fact allows the
distribution P to vary with time as well. However, this Õ((d∆)1/4 ) × T result can
be obtained from their more-general theorem by calculating the various parameters
for this particular setting.
Learning with a Drifting Target Concept
15
We first state the algorithm used to obtain this result. It is primarily based
on a margin-based learning strategy of [ABL13], combined with an initialization
step based on a modified Perceptron
rule
from [DKM09,CMEDV10]. For τ > 0
and x ∈ R, define ℓτ (x) = max 0, 1 − τx . Consider the following algorithm and
subroutine; parameters δk , mk , τk , rk , bk , α, and κ will all be specified in the
P⌈log (1/α)⌉
context of the proof; we suppose M = k=02
mk .
Algorithm: DriftingHalfspaces
0. Let h̃0 be an arbitrary classifier in C
1. For i = 1, 2, . . .
2. h̃i ← ABL(M (i − 1), h̃i−1 )
Subroutine: ModPerceptron(t, h̃)
0. Let wt be any element of Rd with kwt k = 1
1. For m = t + 1, t + 2, . . . , t + m0
2. Choose ĥm = h̃ (i.e., predict Ŷm = h̃(Xm ) as the prediction for Ym )
3. Request the label Ym
4. If hwm−1 (Xm ) 6= Ym
5.
wm ← wm−1 − 2(wm−1 · Xm )Xm
6. Else wm ← wm−1
7. Return wt+m0
Subroutine: ABL(t, h̃)
0. Let w0 be the return value of ModPerceptron(t, h̃)
1. For k = 1, 2, . . . , ⌈log2 (1/α)⌉
2. Wk ← {}
Pk
Pk−1
3. For s = t + j=0 mj + 1, . . . , t + j=0 mj
4.
Choose ĥs = h̃ (i.e., predict Ŷs = h̃(Xs ) as the prediction for Ys )
5.
If |wk−1 · Xs | ≤ bk−1 , Request label Ys and let Wk ← Wk ∪ {(Xs , Ys )}
d
6. Find
Pkvk k ≤ 1, and
P vk ∈ R with kvk − wk−1 k ≤ rk , 0 <
ℓτk (y(v · x)) + κ|Wk |
ℓτk (y(vk · x)) ≤
inf
(x,y)∈Wk
1
kvk k vk
v:kv−wk−1 k≤rk (x,y)∈W
k
7. Let wk =
8. Return hw⌈log2 (1/α)⌉−1
Before stating the proof, we have a few additional lemmas that will be needed.
The following result for ModPerceptron was proven by [CMEDV10].
1
Lemma 2. Suppose ∆ < 512
. Consider the values wm obtained during the execution of ModPerceptron(t, h̃). ∀m ∈ {t + 1, . . . , t + m0 }, P(x : hwm (x) 6=
π2
h∗m (x)) ≤ P(x : hwm−1 (x) 6= h∗m (x)). Furthermore, letting c1 = d·400·2
15 , if
P(x : hwm−1 (x) 6= h∗m (x)) ≥ 1/32, then with probability at least 1/64, P(x :
hwm (x) 6= h∗m (x)) ≤ (1 − c1 )P(x : hwm−1 (x) 6= h∗m (x)).
This implies the following.
16
Steve Hanneke, Varun Kanade, and Liu Yang
Lemma 3. Suppose ∆ ≤
π2
400·227 (d+ln(4/δ)) .
For m0 = max{⌈128(1/c1) ln(32)⌉,
⌈512 ln( 4δ )⌉},
with probability at least 1 − δ/4, ModPerceptron(t, h̃) returns a
vector w with P(x : hw (x) 6= h∗t+m0 +1 (x)) ≤ 1/16.
Proof. By Lemma 2 and a union bound, in general we have
P(x : hwm (x) 6= h∗m+1 (x)) ≤ P(x : hwm−1 (x) 6= h∗m (x)) + ∆.
(6)
Furthermore, if P(x : hwm−1 (x) 6= h∗m (x)) ≥ 1/32, then wth probability at least
1/64,
P(x : hwm (x) 6= h∗m+1 (x)) ≤ (1 − c1 )P(x : hwm−1 (x) 6= h∗m (x)) + ∆.
(7)
In particular, this implies that the number N of values m ∈ {t + 1, . . . , t + m0 }
with either P(x : hwm−1 (x) 6= h∗m (x)) < 1/32 or P(x : hwm (x) 6= h∗m+1 (x)) ≤
(1 − c1 )P(x : hwm−1 (x) 6= h∗m (x)) + ∆ is lower-bounded by a Binomial(m, 1/64)
random variable. Thus, a Chernoff bound implies that with probability at least
1 − exp{−m0 /512} ≥ 1 − δ/4, we have N ≥ m0 /128. Suppose this happens.
Since ∆m0 ≤ 1/32, if any m ∈ {t + 1, . . . , t + m0 } has P(x : hwm−1 (x) 6=
h∗m (x)) < 1/32, then inductively applying (6) implies that P(x : hwt+m0 (x) 6=
h∗t+m0 +1 (x)) ≤ 1/32 + ∆m0 ≤ 1/16. On the other hand, if all m ∈ {t + 1, . . . , t +
m0 } have P(x : hwm−1 (x) 6= h∗m (x)) ≥ 1/32, then in particular we have N values
of m ∈ {t+1, . . . , t+m0 } satisfying (7). Combining this fact with (6) inductively,
we have that
P(x : hwt+m0 (x) 6= h∗t+m0 +1 (x)) ≤ (1 − c1 )N P(x : hwt (x) 6= h∗t+1 (x)) + ∆m0
1
1
+ ∆m0 ≤
.
≤ (1 − c1 )(1/c1 ) ln(32) P(x : hwt (x) 6= h∗t+1 (x)) + ∆m0 ≤
32
16
⊓
⊔
Next, we consider the execution of ABL(t, h̃), and let the sets Wk be as in
that execution. We will denote by w∗ the weight vector with kw∗ k = 1 such that
h∗t+m0 +1 = hw∗ . Also denote by M1 = M − m0 .
The proof relies on a few results proven in the work of [ABL13], which we summarize in the following lemmas. Although the results were proven in a slightly
different setting in that work (namely, agnostic learning under a fixed joint distribution), one can easily verify that their proofs remain valid in our present
context as well.
Lemma 4. [ABL13] Fix any k ∈ {1, . . . , ⌈log2 (1/α)⌉}.
q For a universal constant
√
c7 > 0, suppose bk−1 = c7 21−k / d, and let zk = rk2 /(d − 1) + b2k−1 . For a
universal constant c1 > 0, if kw∗ − wk−1 k ≤ rk ,




X
X
∗
∗
E



ℓτk (|w · x|)wk−1 , |Wk | − E
ℓτk (y(w · x))wk−1 , |Wk | (x,y)∈Wk
(x,y)∈Wk
p
z
k
≤ c1 |Wk | 2k ∆M1 .
τk
Learning with a Drifting Target Concept
17
Lemma 5. [BL13] For any c > 0, there is a constant c′ > 0 depending only on
c (i.e., not depending on d) such that, for any u, v ∈ Rd with kuk = kvk = 1,
letting σ = P(x : hu (x) 6= hv (x)), if σ < 1/2, then
′ σ
≤ cσ.
P x : hu (x) 6= hv (x) and |v · x| ≥ c √
d
The following is a well-known lemma concerning concentration around the
equator for the uniform distribution (see e.g., [DKM09,BBZ07,ABL13]); for instance, it easily follows from the formulas for the area in a spherical cap derived
by [Li11].
Lemma 6. For any constant C > 0, there are constants c2 , c3 > 0 depending
only on C √
(i.e., independent of d) such that, for any w ∈ Rd with kwk = 1,
∀γ ∈ [0, C/ d],
√
√
c2 γ d ≤ P (x : |w · x| ≤ γ) ≤ c3 γ d.
Based on this lemma, [ABL13] prove the following.
Lemma 7. [ABL13]
For X ∼ P, for any w ∈ Rd with kwk = 1, for any C > 0
√
and τ, b ∈ [0, C/ d], for c2 , c3 as in Lemma 6,
h
i c τ
3
.
E ℓτ (|w∗ · X|)|w · X| ≤ b ≤
c2 b
The following is a slightly stronger version of a result of [ABL13] (specifically,
the size of mk , and consequently the bound on |Wk |, are both improved by a
factor of d compared to the original result).
Lemma 8. Fix any δ ∈ (0, 1/e). For universal constants c4 , c5 , c6 , c7 , c8 , c9 , c10 ∈
(0,q
∞), for an appropriate choice of κ ∈ (0, 1) (a universal constant), if α =
√
1
c9 ∆d log κδ
, for every k ∈ {1, . . . , ⌈log2 (1/α)⌉}, if bk−1 = c7 21−k / d, τk =
m
l k
√
c8 2−k / d, rk = c10 2−k , δk = δ/(⌈log2 (4/α)⌉−k)2 , and mk = c5 κ2 2 d log κδ1k ,
−k−3
, then with probability at least 1 −
and if P(x : hwk−1 (x) 6= hw∗ (x))
≤ 2
(4/3)δk , |Wk | ≤ c6 κ12 d log
1
κδk
and P(x : hwk (x) 6= hw∗ (x)) ≤ 2−k−4 .
Proof. By Lemma 6, and a Chernoff and union bound, for an appropriately large
choice of c5 and any c7 > 0, letting c2 , c3 be as in Lemma 6 (with C = c7 ∨(c8 /2)),
with probability at least 1 − δk /3,
c2 c7 2−k mk ≤ |Wk | ≤ 4c3 c7 2−k mk .
The claimed upper bound on |Wk | follows from this second inequality.
Next note that, if P(x : hwk−1 (x) 6= hw∗ (x)) ≤ 2−k−3 , then
√
max{ℓτk (y(w∗ · x)) : x ∈ Rd , |wk−1 · x| ≤ bk−1 , y ∈ {−1, +1}} ≤ c11 d
(8)
18
Steve Hanneke, Varun Kanade, and Liu Yang
for some universal constant c11 > 0. Furthermore, since P(x : hwk−1 (x) 6=
hw∗ (x)) ≤ 2−k−3 , we know that the angle between wk−1 and w∗ is at most
2−k−3 π, so that
q
p
kwk−1 − w∗ k = 2 − 2wk−1 · w∗ ≤ 2 − 2 cos(2−k−3 π)
q
√
√
≤ 2 − 2 cos2 (2−k−3 π) = 2 sin(2−k−3 π) ≤ 2−k−3 π 2.
√
For c10 = π 22−3 , this is rk . By Hoeffding’s inequality (under the conditional
distribution given |Wk |), the law of total probability, Lemma 4, and linearity of
conditional expectations, with probability at least 1 − δk /3, for X ∼ P,
i
h
X
ℓτk (y(w∗ · x)) ≤ |Wk |E ℓτk (|w∗ · X|)wk−1 , |wk−1 · X| ≤ bk−1
(x,y)∈Wk
q
p
zk
+ c1 |Wk | 2k ∆M1 + |Wk |(1/2)c211 d ln(3/δk ). (9)
τk
We bound each term on the right hand side separately. By Lemma 7, the first
τk
c3 c8
. Next,
= |Wk | 2c
term is at most |Wk | c2cb3k−1
2 c7
zk
=
τk
p
p
c210 2−2k /(d − 1) + 4c27 2−2k /d
2c210 + 4c27
√
,
≤
−k
c8
c8 2 / d
while 2k ≤ 2/α so that the second term is at most
r
p
√
2c210 + 4c27
∆m
2c1
|Wk |
.
c8
α
Noting that
M1 =
⌈log2 (1/α)⌉
X
k′ =1
mk ′
32c5 1
≤ 2 d log
κ α
1
κδ
,
(10)
we find that the second term on the right hand side of (9) is at most
s
p
r
√ p
1
2
2
∆d log κδ
8c1 c5 2c210 + 4c27
c5 8c1 2c10 + 4c7
|Wk |
=
|Wk |.
c9 κ
c8
α2
κ
c8 c9
2
−k
mk , and (8) implies 2−k mk ≤
Finally, since d ln(3/δk ) ≤ 2d ln(1/δk ) ≤ 2κ
c5 2
1
c2 c7 |Wk |, the third term on the right hand side of (9) is at most
c11 κ
|Wk | √
.
c2 c5 c7
Altogether, we have
X
(x,y)∈Wk
ℓτk (y(w∗ · x)) ≤ |Wk |
!
√ p
8c1 c5 2c210 + 4c27
c3 c8
c11 κ
.
+
+√
2c2 c7
κ
c8 c9
c2 c5 c7
Learning with a Drifting Target Concept
19
Taking c9 = 1/κ3 and c8 = κ, this is at most
q
√
c11
c3
2
2
.
+ 8c1 c5 2c10 + 4c7 + √
κ|Wk |
2c2 c7
c2 c5 c7
Next, note that because hwk (x) 6= y ⇒ ℓτk (y(vk · x)) ≥ 1, and because (as
proven above) kw∗ − wk−1 k ≤ rk ,
X
X
ℓτk (y(w∗ · x)) + κ|Wk |.
ℓτk (y(vk · x)) ≤
|Wk |erWk (hwk ) ≤
(x,y)∈Wk
(x,y)∈Wk
Combined with the above, we have
q
√
c3
c11
|Wk |erWk (hwk ) ≤ κ|Wk | 1 +
.
+ 8c1 c5 2c210 + 4c27 + √
2c2 c7
c2 c5 c7
√ p
Let c12 = 1 + 2cc23c7 + 8c1 c5 2c210 + 4c27 + √cc211
c5 c7 . Furthermore,
|Wk |erWk (hwk ) =
X
(x,y)∈Wk
≥
1[hwk (x) 6= y]
X
(x,y)∈Wk
1[hwk (x) 6= hw∗ (x)] −
X
(x,y)∈Wk
1[hw∗ (x) 6= y].
For an appropriately large value of c5 , by a Chernoff bound, with probability at
least 1 − δk /3,
Pk
mj
Pk−1
mj +1
t+
s=t+
j=0
X
j=0
1[hw∗ (Xs ) 6= Ys ] ≤ 2e∆M1mk + log2 (3/δk ).
In particular, this implies
X
1[hw∗ (x) 6= y] ≤ 2e∆M1mk + log2 (3/δk ),
(x,y)∈Wk
so that
X
(x,y)∈Wk
1[hwk (x) 6= hw∗ (x)] ≤ |Wk |erWk (hwk ) + 2e∆M1mk + log2 (3/δk ).
Noting that (10) and (8) imply
1
d log κδ
32c5
q
∆M1 mk ≤ ∆ 2
κ c ∆d log
9
=
1
κδ
32c5
2k
c c |Wk | ≤ c c c κ2
2 7
2 7 9
s
∆d log
32c5
32c5 κ4 k
32c5 κ4
k
α2
|W
|
=
α2
|W
|
≤
|Wk |,
k
k
c2 c7 c29 κ2
c2 c7
c2 c7
1
2k |Wk |
κδ
20
Steve Hanneke, Varun Kanade, and Liu Yang
and (8) implies log2 (3/δk ) ≤
2κ2
c2 c5 c7 |Wk |,
altogether we have
64ec5 κ4
2κ2
|Wk | +
|Wk |
c2 c7
c2 c5 c7
(x,y)∈Wk
2κ
64ec5 κ3
.
+
≤ κ|Wk | c12 +
c2 c7
c2 c5 c7
P
2
5
Letting c13 = c12 + 64ec
(x,y)∈Wk 1[hwk (x) 6=
c2 c7 + c2 c5 c7 , and noting κ ≤ 1, we have
hw∗ (x)] ≤ c13 κ|Wk |.
Lemma 1 (applied under the conditional distribution given |Wk |) and the law
of total probability imply that with probability at least 1 − δk /3,
|Wk |P x : hwk (x) 6= hw∗ (x)|wk−1 · x| ≤ bk−1
X
p
1[hwk (x) 6= hw∗ (x)] + c14 |Wk |(d log(|Wk |/d) + log(1/δk )),
≤
X
1[hwk (x) 6= hw∗ (x)] ≤ |Wk |erWk (hwk ) +
(x,y)∈Wk
for a universal constant c14 > 0. Combined with the above, and the fact that
2
(8) implies log(1/δk ) ≤ c2κc5 c7 |Wk | and


8c3 c5 c7 log κδ1k

d log(|Wk |/d) ≤ d log 
κ2
1
8c3 c5 c7
≤
3
log(8
max{c
,
1}c
)c
d
log
≤ d log
3
5 5
κ3 δk
κδk
3
log(8
max{c
,
1})
3
≤ 3 log(8 max{c3 , 1})κ2 2−k mk ≤
κ2 |Wk |,
c2 c7
we have
|Wk |P x : hwk (x) 6= hw∗ (x)|wk−1 · x| ≤ bk−1
s
3 log(8 max{c3 , 1}) 2
κ2
≤ c13 κ|Wk | + c14 |Wk |
κ |Wk | +
|Wk |
c2 c7
c2 c5 c7


s
3
log(8
max{c
,
1})
1
3
.
= κ|Wk | c13 + c14
+
c2 c7
c2 c5 c7
Thus, letting c15
q
3 ,1})
+
= c13 + c14 3 log(8 max{c
c2 c7
1
c2 c5 c7
, we have
P x : hwk (x) 6= hw∗ (x)|wk−1 · x| ≤ bk−1 ≤ c15 κ.
(11)
Next, note that kvk − wk−1 k2 = kvk k2 + 1 − 2kvk k cos(πP(x : hwk (x) 6=
hwk−1 (x))). Thus, one implication of the fact that kvk −wk−1 k ≤ rk is that kv2k k +
Learning with a Drifting Target Concept
21
2
1−rk
2kvk k
≤ cos(πP(x : hwk (x) 6= hwk−1 (x))); since the left hand side is positive, we
have P(x : hwk (x) 6= hwk−1 (x)) < 1/2. Additionally, by differentiating, one
p
can easily verify that for φp∈ [0, π], x 7→ x2 + 1 − 2x cos(φ) is minimized at
x = cos(φ), in which case x2 + 1 − 2x cos(φ) = sin(φ). Thus, kvk − wk−1 k ≥
sin(πP(x : hwk (x) 6= hwk−1 (x))). Since kvk − wk−1 k ≤ rk , we have sin(πP(x :
hwk (x) 6= hwk−1 (x))) ≤ rk . Since sin(πx) ≥ x for all x ∈ [0, 1/2], combining
this with the fact (proven above) that P(x : hwk (x) 6= hwk−1 (x)) < 1/2 implies
P(x : hwk (x) 6= hwk−1 (x)) ≤ rk .
In particular, we have that both P(x : hwk (x) 6= hwk−1 (x)) ≤ rk and P(x :
hw∗ (x) 6= hwk−1 (x)) ≤ 2−k−3 ≤ rk . Now Lemma 5 implies that, for any universal
constant c > 0, there exists a corresponding universal constant c′ > 0 such that
rk
P x : hwk (x) 6= hwk−1 (x) and |wk−1 · x| ≥ c′ √
≤ crk
d
and
rk
P x : hw∗ (x) 6= hwk−1 (x) and |wk−1 · x| ≥ c′ √
≤ crk ,
d
so that (by a union bound)
rk
P x : hwk (x) 6= hw∗ (x) and |wk−1 · x| ≥ c′ √
d
rk
≤ P x : hwk (x) 6= hwk−1 (x) and |wk−1 · x| ≥ c′ √
d
rk
≤ 2crk .
+ P x : hw∗ (x) 6= hwk−1 (x) and |wk−1 · x| ≥ c′ √
d
rk
In particular, letting c7 = c′ c10 /2, we have c′ √
= bk−1 . Combining this with
d
(11), Lemma 6, and a union bound, we have that
P (x : hwk (x) 6= hw∗ (x))
≤ P (x : hwk (x) 6= hw∗ (x) and |wk−1 · x| ≥ bk−1 )
+ P (x : hwk (x) 6= hw∗ (x) and |wk−1 · x| ≤ bk−1 )
≤ 2crk + P x : hwk (x) 6= hw∗ (x)|wk−1 · x| ≤ bk−1 P (x : |wk−1 · x| ≤ bk−1 )
√
≤ 2crk + c15 κc3 bk−1 d = 25 cc10 + c15 κc3 c7 25 2−k−4 .
Taking c = 261c10 and κ = 26 c31c7 c15 , we have P(x : hwk (x) 6= hw∗ (x)) ≤ 2−k−4 ,
as required.
By a union bound, this occurs with probability at least 1 − (4/3)δk .
⊓
⊔
Proof (Proof of Theorem 3). We begin with the bound on the error rate. If ∆ >
p
400·227
π2
∆(d + ln(4/δ)).
400·227 (d+ln(4/δ)) , the result trivially holds, since then 1 ≤
π2
Otherwise, suppose ∆ ≤
π2
400·227 (d+ln(4/δ)) .
22
Steve Hanneke, Varun Kanade, and Liu Yang
Fix any i ∈ N. Lemma 3 implies that, with probability at least 1 − δ/4,
the w0 returned in Step 0 of ABL(M (i − 1), h̃i−1 ) satisfies P(x : hw0 (x) 6=
h∗M(i−1)+m0 +1 (x)) ≤ 1/16. Taking this as a base case, Lemma 8 then inductively
implies that, with probability at least
δ
1− −
4
⌈log2 (1/α)⌉
X
k=1
δ
δ
≥ 1−
(4/3)
2(⌈log2 (4/α)⌉ − k)2
2
∞
X
1
1 + (4/3)
ℓ2
ℓ=2
!
≥ 1−δ,
every k ∈ {0, 1, . . . , ⌈log2 (1/α)⌉} has
P(x : hwk (x) 6= h∗M(i−1)+m0 +1 (x)) ≤ 2−k−4 ,
(12)
and furthermore the number of labels requested during ABL(M (i − 1), h̃i−1 )
total to at most (for appropriate universal constants ĉ1 , ĉ2 )


⌈logX
⌈log2 (1/α)⌉
2 (1/α)⌉
2
X
1
(⌈log
(4/α)⌉
−
k)
2

m0 +
|Wk | ≤ ĉ1 d + ln
+
d log
δ
δ
k=1
k=1
1
1
log
.
≤ ĉ2 d log
∆d
δ
In particular, by a union bound, (12) implies that for every k ∈ {1, . . . , ⌈log2 (1/α)⌉},
every


k
k−1


X
X
mj
mj + 1, . . . , M (i − 1) +
m ∈ M (i − 1) +


j=0
j=0
has
P(x : hwk−1 (x) 6= h∗m (x))
≤ P(x : hwk−1 (x) 6= h∗M(i−1)+m0 +1 (x)) + P(x : h∗M(i−1)+m0 +1 (x) 6= h∗m (x))
≤ 2−k−3 + ∆M.
Thus, noting that


⌈logX
2 (1/α)⌉
1
⌈log2 (1/α)⌉ − k 
M=
mk = Θ d + log
+
2k d log
δ
δ
k=0
k=1
s
!
1
1
d
1
=Θ
,
d log
log
=Θ
α
δ
∆
δ
⌈log 2 (1/α)⌉
X
with probability at least 1 − δ,
P(x : hw⌈log2 (1/α)⌉−1 (x) 6=
h∗Mi (x))
≤ O (α + ∆M ) = O
s
!
1
.
∆d log
δ
Learning with a Drifting Target Concept
23
In particular, this implies that, with probability at least 1 − δ, every t ∈ {M i +
1, . . . , M (i + 1) − 1} has
ert (ĥt ) ≤ P(x : hw⌈log2 (1/α)⌉−1 (x) 6= h∗Mi (x)) + P(x : h∗Mi (x) 6= h∗t (x))
s
s
!
!
1
1
+ ∆M = O
,
≤O
∆d log
∆d log
δ
δ
which completes √
the proof of the bound on the error rate.
Setting δ = ∆d, and noting that 1[Ŷt 6= Yt ] ≤ 1, we have that for any
t > M,
s
s
!
!
h
i
1
1
+δ =O
.
∆d log
∆d log
P Ŷt 6= Yt = E ert (ĥt ) ≤ O
δ
∆d
Thus, by linearity of the expectation,
s
s
" T
#
!
!
i
X h
1
1
E
1 Ŷt 6= Yt ≤ M + O
∆d log
∆d log
T =O
T .
∆d
∆d
t=1
Furthermore, as mentioned, with probability at least 1 − δ, the number of labels
requested during the execution of ABL(M (i − 1), h̃i−1 ) is at most
1
1
log
.
O d log
∆d
δ
Thus, since the number of labels requested
√ during the execution of ABL(M (i −
1), h̃i−1 ) cannot exceed M , letting δ = ∆d, the expected number of requested
labels during this execution is at most
s !
√
1
1
1
2
2
O d log
+ ∆dM ≤ O d log
+ O d log
∆d
∆d
∆d
1
= O d log2
.
∆d
Thus, by linearity of the expectation, the expected number of labels requested
among the first T samples is at most
√
T
1
1
=O
T ,
∆d log3/2
O d log2
∆d
M
∆d
which completes the proof.
⊓
⊔
Remark: The original work of [CMEDV10] additionally allowed for some number
K of “jumps”: times t at which ∆t = 1. Note that, in the above algorithm, since
the influence of each sample is localized to the predictors trained within that
24
Steve Hanneke, Varun Kanade, and Liu Yang
“batch” of M instances, the effect of allowing
would
only change
such jumps
q
√
d
d∆T + ∆ K . This compares
the bound on the number of mistakes to Õ
d1/4
favorably to the result of [CMEDV10], which is roughly O (d∆)1/4 T + ∆
3/4 K .
However, the result of [CMEDV10] was proven for a more general setting, allowing distributions P that are not uniform (though they do require a relation
between the angle between any two separators and the probability mass they
disagree on, similar to that holding for the uniform distribution, which seems
to require that the distributions approximately retain some properties of the
uniform distribution). It is not clear whether Theorem 3 can be generalized to
this larger family of distributions.
6
General Results for Active Learning
As mentioned, the above results on linear separators also provide results for the
number of queries in active learning. One can also state quite general results on
the expected number of queries and mistakes achievable by an active learning
algorithm. This section provides such results, for an algorithm based on the
the well-known strategy of disagreement-based active learning. Throughout this
section, we suppose h∗ ∈ S∆ , for a given ∆ ∈ (0, 1]: that is, P(x : h∗t+1 (x) 6=
h∗t (x)) ≤ ∆ for all t ∈ N.
First, we introduce a few definitions. For any set H ⊆ C, define the region of
disagreement
DIS(H) = {x ∈ X : ∃h, g ∈ H s.t. h(x) 6= g(x)}.
The analysis in this section is centered around the following algorithm. The
Active subroutine is from the work of [Han12] (slightly modified here), and is a
variant of the A2 (Agnostic Acive) algorithm of [BBL06]; the appropriate values
of M and T̂k (·) will be discussed below.
Algorithm: DriftingActive
0. For i = 1, 2, . . .
1. Active(M (i − 1))
Subroutine: Active(t)
0. Let ĥ0 be an arbitrary element of C, and let V0 ← C
1. Predict Ŷt+1 = ĥ0 (Xt+1 ) as the prediction for the value of Yt+1
2. For k = 0, 1, . . . , log2 (M/2)
3. Qk ← {}
4. For s = 2k + 1, . . . , 2k+1
5.
Predict Ŷs = ĥk (Xs ) as the prediction for the value of Ys
6.
If Xs ∈ DIS(Vk )
7.
Request the label Ys and let Qk ← Qk ∪ {(Xs , Ys )}
P
8. Let ĥk+1 = argminh∈Vk (x,y)∈Qk 1[h(x) 6= y]
P
9. Let Vk+1 ← {h ∈ Vk : (x,y)∈Qk 1[h(x) 6= y] − 1[ĥk+1 (x) 6= y] ≤ T̂k }
Learning with a Drifting Target Concept
25
As in the DriftingHalfspaces algorithm above, this DriftingActive algorithm
proceeds in batches, and in each batch runs an active learning algorithm designed to be robust to classification noise. This robustness to classification noise
translates into our setting as tolerance for the fact that there is no classifier in
C that perfectly classifies all of the data. The specific algorithm employed here
maintains a set Vk ⊆ C of candidate classifiers, and requests the labels of samples
Xs for which there is some disagreement on the classification among classifiers in
Vk . We maintain the invariant that there is a low-error classifier contained in Vk
at all times, and thus the points we query provide some information to help us
determine which among these remaining candidates has low error rate. Based on
these queries, we periodically (in Step 9) remove from Vk those classifiers making
a relatively excessive number of mistakes on the queried samples, relative to the
minimum among classifiers in Vk . All predictions are made with an element of
Vk .7
We prove an abstract bound on the number of labels requested by this algorithm, expressed in terms of the disagreement coefficient [Han07], defined as
follows. For any r ≥ 0 and any classifier h, define B(h, r) = {g ∈ C : P(x :
g(x) 6= h(x)) ≤ r}. Then for r0 ≥ 0 and any classifier h, define the disagreement
coefficient of h with respect to C under P:
θh (r0 ) = sup
r>r0
P(DIS(B(h, r)))
.
r
Usually, the disagreement coefficient would be used with h equal the target concept; however, since the target concept is not fixed in our setting, we will make
use of the worst-case value of the disagreement coefficient: θC (r0 ) = suph∈C θh (r0 ).
This quantity has been bounded for a variety of spaces C and distributions P
(see e.g., [Han07,EYW12,BL13]). It is useful in bounding how quickly the region
DIS(Vk ) collapses in the algorithm. Thus, since the probability the algorithm
requests the label of the next instance is P(DIS(Vk )), the quantity θC (r0 ) naturally arises in characterizing the number of labels we expect this algorithm to
request. Specifically, we have the following result.8
Theorem 4. For an appropriate universal
constant c1 ∈ [1, ∞), if h∗ ∈ S∆ for
q √
d
, and T̂k = log2 (1/ d∆)+22k+2 e∆,
some ∆ ∈ (0, 1], then taking M = c1 ∆
2
√
and defining ǫ∆ = d∆Log(1/(d∆)), the above DriftingActive algorithm makes
an expected number of mistakes among the first T instances that is
√ O (ǫ∆ Log(d/∆)T ) = Õ
d∆ T
and requests an expected number of labels among the first T instances that is
√
√ O (θC (ǫ∆ )ǫ∆ Log(d/∆)T ) = Õ θC ( d∆) d∆ T.
7
8
One could alternatively proceed as in DriftingHalfspaces, using the final classifier
from the previous batch, which would also add a guarantee on the error rate achieved
at all sufficiently large t.
Here, we define ⌈x⌉2 = 2⌈log2 (x)⌉ , for x ≥ 1.
26
Steve Hanneke, Varun Kanade, and Liu Yang
The proof of Theorem 4 relies on an analysis of the behavior of the Active
subroutine, characterized in the following lemma.
Lemma 9. Fix any t ∈ N, and consider the values obtained in the execution
of Active(t). Under the conditions of Theorem 4, there is a universal constant
c2 ∈ [1, ∞)
√ such that, for any k ∈ {0, 1, . . . , log2 (M/2)}, with probability at
least 1 − 2 d∆, if h∗t+1 ∈ Vk , then h∗t+1 ∈ Vk+1 and suph∈Vk+1 P(x : h(x) 6=
√
h∗t+1 (x)) ≤ c2 2−k dLog(c1 / d∆).
Proof. By a Chernoff bound, with probability at least 1 −
k+1
2X
√
d∆,
√
s=2k +1
1[h∗t+1 (Xs ) 6= Ys ] ≤ log2 (1/ d∆) + 22k+2 e∆ = T̂k .
Therefore, if h∗t+1 ∈ Vk , then since every g ∈ Vk agrees with h∗t+1 on those points
Xs ∈
/ DIS(Vk ), in the update in Step 9 defining Vk+1 , we have
X
(x,y)∈Qk
=
1[h∗t+1 (x) 6= y] − 1[ĥk+1 (x) 6= y]
k+1
2X
1
[h∗t+1 (Xs )
s=2k +1
≤
k+1
2X
s=2k +1
6= Ys ] − min
g∈Vk
k+1
2X
s=2k +1
1[g(Xs ) 6= Ys ]
1[h∗t+1 (Xs ) 6= Ys ] ≤ T̂k ,
so that h∗t+1 ∈ Vk+1 as well.
Furthermore, if h∗t+1 ∈ Vk , then by the definition of Vk+1 , we know every
h ∈ Vk+1 has
k+1
2X
s=2k +1
1[h(Xs ) 6= Ys ] ≤ T̂k +
k+1
2X
s=2k +1
1[h∗t+1 (Xs ) 6= Ys ],
so that a triangle inequality implies
k+1
2X
s=2k +1
1[h(Xs ) 6=
h∗t+1 (Xs )]
≤
k+1
2X
s=2k +1
≤ T̂k + 2
1[h(Xs ) 6= Ys ] + 1[h∗t+1 (Xs ) 6= Ys ]
k+1
2X
s=2k +1
1[h∗t+1 (Xs ) 6= Ys ] ≤ 3T̂k .
Learning with a Drifting Target Concept
27
Lemma 1 then implies that, on an additional event of probability at least 1 −
√
d∆, every h ∈ Vk+1 has
P(x : h(x) 6= h∗t+1 (x))
q
√
≤ 2−k 3T̂k + c2−k 3T̂k (dLog(2k /d) + Log(1/ d∆))
√
+ c2−k (dLog(2k /d) + Log(1/ d∆))
q
√
√
√
≤ 2−k 3 log2 (1/ d∆) + 2k 12e∆ + c2−k 6 log2 (1/ d∆)dLog(c1 / d∆)
q
√
√
−k
+ c2
22k 24e∆dLog(c1 / d∆) + 2c2−k dLog(c1 / d∆)
√
√
√
√
≤ 2−k 3 log2 (1/ d∆) + 12ec1 d∆ + 3c2−k dLog(c1 / d∆)
q
√
√
√
+ 24ec d∆Log(c1 / d∆) + 2c2−k dLog(c1 / d∆),
where c is as in Lemma 1. Since
5 + 12ec21 + 3c +
√
d∆ ≤ 2c1 d/M ≤ c1 d2−k , this is at most
√
√
24ecc1 + 2c 2−k dLog(c1 / d∆).
√
Letting c2 = 5 + 12ec21 + 3c + 24ecc1 + 2c, we have the result by a union bound.
⊓
⊔
We are now ready for the proof of Theorem 4.
Proof (Proof of Theorem 4). Fix any i ∈ N, and consider running Active(M (i −
1)). Since h∗M(i−1)+1 ∈ C, by Lemma 9, a union bound, and induction, with
√
p
√
probability at least 1 − 2 d∆ log2 (M/2) ≥ 1 − 2 d∆ log2 (c1 d/∆), every k ∈
{0, 1, . . . , log2 (M/2)} has
√
sup P(x : h(x) 6= h∗M(i−1)+1 (x)) ≤ c2 21−k dLog(c1 / d∆).
(13)
h∈Vk
Thus, since ĥk ∈ Vk for each k, the expected number of mistakes among the
predictions ŶM(i−1)+1 , . . . , ŶMi is
28
Steve Hanneke, Varun Kanade, and Liu Yang
k+1
2X
log2 (M/2)
X
1+
k=0
s=2k +1
log2 (M/2)
≤1+
X
k=0
+
k=0
k+1
2X
s=2k +1
log2 (M/2)
X
P(ĥk (XM(i−1)+s ) 6= YM(i−1)+s )
k+1
2X
s=2k +1
P(h∗M(i−1)+1 (XM(i−1)+s ) 6= YM(i−1)+s )
P(ĥk (XM(i−1)+s ) 6= h∗M(i−1)+1 (XM(i−1)+s ))
log2 (M/2)
2
≤ 1 + ∆M +
X
k=0
√
√
2k c2 21−k dLog(c1 / d∆) + 2 d∆ log2 (M/2)
√
p
p
≤ 1 + 4c21 d + 2c2 dLog(c1 / d∆) log2 (2c1 d/∆) + 4c1 d log2 (c1 d/∆)
= O (dLog(d/∆)Log(1/(d∆))) .
Furthermore, (13) implies the algorithm only requests the label YM(i−1)+s for
√
s ∈ {2k +1, . . . , 2k+1 } if XM(i−1)+s ∈ DIS(B(h∗M(i−1)+1 , c2 21−k dLog(c1 / d∆))),
so that the expected number of labels requested among YM(i−1)+1 , . . . , YMi is at
most
log2 (M/2)
1+
X
k=0
√
2k E[P(DIS(B(h∗M(i−1)+1 , c2 21−k dLog(c1 / d∆))))]
p
√
+2 d∆ log2 (c1 d/∆)
p
√
√
≤ 1 + θC 4c2 dLog(c1 / d∆)/M 2c2 dLog(c2 / d∆) log2 (2c1 d/∆)
p
+ 4c1 d log2 (c1 d/∆)
√
= O θC
d∆Log(1/(d∆)) dLog(d/∆)Log(1/(d∆)) .
Thus, the expected number of mistakes among indices 1, . . . , T is at most
√
T
d∆Log(d/∆)Log(1/(d∆))T ,
=O
O dLog(d/∆)Log(1/(d∆))
M
and the expected number of labels requested among indices 1, . . . , T is at most
√
T
O θC
d∆Log(1/(d∆)) dLog(d/∆)Log(1/(d∆))
M
√
√
= O θC
d∆Log(1/(d∆))
d∆Log(d/∆)Log(1/(d∆))T .
⊓
⊔
Learning with a Drifting Target Concept
29
References
ABL13.
P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for
efficiently learning linear separators with noise. arXiv:1307.8371v2, 2013.
BBDK00. P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing
concepts by exploiting the structure of change. Machine Learning, 41:153–
174, 2000.
BBL06.
M. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning.
In Proc. of the 23rd International Conference on Machine Learning, 2006.
BBZ07.
M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In
Proceedings of the 20th Conference on Learning Theory, 2007.
BH96.
P. L. Bartlett and D. P. Helmbold. Learning changing problems, 1996.
(unpublished).
BL96.
R. D. Barve and P. M. Long. On the complexity of learning from drifting
distributions. In Proceedings of the ninth annual conference on Computational learning theory, COLT ’96, pages 122–130, 1996.
BL97.
R. D. Barve and P. M. Long. On the complexity of learning from drifting
distributions. Inf. Comput., 138(2):170–193, 1997.
BL13.
M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In Proceedings of the 26th Conference
on Learning Theory, 2013.
BLM13.
S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities.
Oxford University Press, 2013.
CMEDV10. K. Crammer, Y. Mansour, E. Even-Dar, and J. Wortman Vaughan. Regret
minimization with concept drift. In COLT, pages 168–180, 2010.
DKM09.
S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based
active learning. Journal of Machine Learning Research, 10:281–299, 2009.
EYW12.
R. El-Yaniv and Y. Wiener. Active learning via perfect selective classification. Journal of Machine Learning Research, 13:255–279, 2012.
Han07.
S. Hanneke. A bound on the label complexity of agnostic active learning.
In Proceedings of the 24th International Conference on Machine Learning,
2007.
Han12.
S. Hanneke. Activized learning: Transforming passive to active with
improved label complexity. Journal of Machine Learning Research,
13(5):1469–1587, 2012.
HL91.
D. P. Helmbold and P. M. Long. Tracking drifting concepts using random
examples. In COLT, pages 13–23, 1991.
HL94.
D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing
disagreements. Machine Learning, 14(1):27–45, 1994.
HLW94.
D. Haussler, N. Littlestone, and M. Warmuth. Predicting {0, 1}-functions
on randomly drawn points. Information and Computation, 115:248–292,
1994.
Li11.
S. Li. Concise formulas for the area and volume of a hyperspherical cap.
Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011.
Lon99.
P. M. Long. The complexity of learning according to two models of a
drifting environment. Machine Learning, 37(3):337–354, 1999.
Vap82.
V. Vapnik. Estimation of Dependencies Based on Empirical Data.
Springer-Verlag, New York, 1982.
Vap98.
V. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., New
York, 1998.
30
VC71.
Vid03.
Steve Hanneke, Varun Kanade, and Liu Yang
V. Vapnik and A. Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and its
Applications, 16:264–280, 1971.
M. Vidyasagar. Learning and Generalization with Applications to Neural
Networks. Springer-Verlag, 2nd edition, 2003.