Active Learning

Transcripción

Active Learning

Active Learning
Jesús Cid-Sueiro
MLG, Marzo 2013.
Active learning. MLG. Marzo, 2013
1
Warning
  Esta
presentation hace uso del Spanglish
extensively.
2
Cuaderno de bitácora.
  Antecedentes:
◦  Algún paper JMLR (Hanneke, 2012) (Dekel, 2012)
(El-yaniv, 2012)
  Starting
work:
◦  Wikipedia J
◦  Googling “active learning” & “machine learning”
◦  active-learning.net
  Active Learning Tutorial (Dasgupta, 2009)
  Theory, Methods and Applications of Active Learning
(Nowak & Castro, 2009)
◦  Dos tutoriales (Settles, 2009, 2011)
◦  Papers and sites.
3
Active Learning traces back to…
 
(Fuente: Nowak, Theory and Applications of Active Learning,
http://videolectures.net/mlss09us_nowak_castro_tmaal/)
4
We humans are active learners
 
 
(Fuente: Nowak, Theory and Applications of Active Learning,
http://videolectures.net/mlss09us_nowak_castro_tmaal/)
También
◦ 
◦ 
◦ 
◦ 
R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, and X. Zhu. Human active learning, NIPS, 22. 2008.
http://www.youtube.com/watch?v=L5O3DYjZ_IE&feature=share
http://www.youtube.com/watch?v=vJG698U2Mvo
http://www.youtube.com/watch?feature=player_embedded&v=AqOEdzanMCE
5
Aprendizaje activo
  El
aprendiz captura datos
de entrenamiento,
adaptativa o
interactivamente,
típicamente solicitando
etiquetas a un oráculo.
  Objetivo: superar un
método supervisado
estándar.
◦  Prestaciones comparables
con menos etiquetas…
6
Active learning examp
Aplicaciones
 
Goal: find compounds w
Aquellas en las que el
etiquetado es caro, e.g.,
cuando cada etiqueta
implica un experimento o
un ensayo clínico…
◦  Biología computacional
(Mohri, 2012).
◦  Diseño de medicamentos
(Warmuth, 2001)
 
… o cuando el volumen de
datos es prohibitivo para
un etiquetado manual
exhaustivo
◦  Understanding complex
systems (Internet,
Twitter…)
unlabeled point ≡
label ≡
getting a label ≡
7
Aplicaciones
◦  Sensor networks
◦  Procesado de Lenguaje
Natural (Olson, 2009)
◦  Detección visual de objetos
(Abramsom).
  Eg. Pedestrian detection (Freund,
2003)
◦  Anotación de textos (Krithara,
2007)
8
Aplicaciones
  Image
processing
◦  Aplicación en Laser Balistic imaging.
9
  Regresión
(Ejemplo): Image processing
10
Escenarios
 
Query synthesis (Angluin, 1988):
◦  Sintetiza, secuencialmente, N observaciones de las que se pretende
conocer la etiqueta (e.g. interpolación funcional).
  Estimación de la posición de una mano robótica a partir de medidas angulares de
su brazo (Cohn, 1996)
  Robot científico (King, 2009): sintetiza experimentos biológicos para descubrir
rutas metabólicas de una levadura
  Es problemática cuando el oráculo es humano (e.g. Reconocimiento de caracteres
manuscritos (Lang&Baun, 1992))
 
Stream based (Selective Sampling /Sequential AL):
◦  De un tren de datos, descarta/selecciona muestra a muestra hasta
completar un presupuesto de N etiquetas.
  Part-of-speech tagging (Dagan, 1995), Sensor scheduling (Krishnamurthy, 2002),
Learning ranking functions in information retrieval (Yu, 2005), Word disambiguation
(Fuji, 1998)
 
Pool based:
◦  De un conjunto (muy grande) de K muestras potenciales de
entrenamiento, selecciona N (secuencialmente).
  Es el escenario más utilizado: Text classification, Information Extraction, Image/
Video Classification and Retrieval, Speech Recognition, Cancer Diagnosis, …
11
Técnicas
 
Haydetó:
◦  (boosting “active learning”, 11700 resultados)
  P Melville, RJ Mooney, Diverse ensembles for active learning, Machine
Learning Int.Workshop 2004.
  Citado por 143
◦  (kernel “active learning”, 7340 resultados)
  S. Tong, D. Koller, Support vector machine active learning with applications
to text classification
◦  (“Gaussian process” “active learning”, 864 resultados)
  A Krause, C Guestrin, Nonmyopic active learning of gaussian processes:
an exploration-exploitation approach, ICML, 2007.
◦  (“bayesian active learning”, 143 resultados)
  N. Houlsby, F. Huszár, Z. Ghahramani, M. Lengyel, Bayesian Active Learning
for Classification and Preference Learning, arXiv:1112.5745
12
Taxonomía
  Passive Learning (PL)
  Active Learning (AL)
◦  Por el modo de uso y acceso al oráculo:
  Query synthesis
  Pool-based
  Stream-based
◦  Por el tipo de función a aprender:
  Noise-free case (oráculo determinista)
  Noisy case (oráculo estocástico)
◦  Por la relación entre la función a aprender y el
espacio de observación:
  Realizable AL (la clase de hipótesis incluye a la correcta)
  Agnostic AL
13
FAQ.
  ¿Funciona
el aprendizaje activo?
◦  A veces, sí.
14
El AL puede funcionar
  Un
caso Noise-free:
◦  Estimar el punto de
transición de un escalón
con precisión ε.
  Passive
Learning:
◦  (Caso mejor) Muestras
equiespaciadas: se
precisan ~1/ε muestras.
  Active
Learning:
◦  Binary search: se precisan
~log(1/ε) muestras.
15
El AL puede funcionar
  Las
ventajas de AL en
este caso persisten
aunque…
◦  …el escenario sea
stream-based, poolbased, o querysynthesis-based,
◦  …p(x) no sea uniforme,
◦  …el oráculo sea
adversario (y vaya
cambiando el umbral
s.p.j.).
16
Receive unla
[CAL ’91]
If there is an
For separable data that is streaming in.query lab
else
Ht+1 = H
H1 = hypothesis class
A generic mellow learner
Algoritmo CAL
Repeat for t =
1, 2, . . .
(Generic Mellow Learner) (Cohn,
1994)
Receive unlabeled point xt
If there is any disagreement within Ht a
generic mellow learner [CAL ’91] query label yt and set Ht+1 = {h ∈ H
else
For separable data that is streaming in.
Ht+1 = Ht
Es unA algoritmo muestra-a-muestra
para datos separables (Stream-based,
realizable case).
Repeat for t = 1, 2, . . .
  Algoritmo: Receive unlabeled point x
 
t
Is a label needed?
Problems: (1) intrac
If there is any disagreement within Ht about xt ’s label:
H1 = Hypothesis class
query label yt and set Ht+1 = {h ∈ Ht : h(xt ) = yt }
else
Repeat
Ht+1 = Ht
Is a label needed?
Ht = current candidate
Captura xt
hypotheses
Si hay discrepancia entreProblems:
Ht sobre(1)
la intractable to maintain Ht ; (2)
etiqueta de xt,
Solicita yt,
Ht+1 ={h∈Ht : h(xt)=yt}
a labelcontrario
needed?
Region of uncertainty
En Iscaso
hypotheses
Ht+1 = H(1)
t intractable to maintain H ; (2) nonseparable data.
Problems:
t
17
Algoritmo CAL
A
Receive unla
If there is an
generic mellow learner [CAL ’91] query lab
For separable data that is streamingelse
in.
Ht+1 = H
H = hypothesis
class
1994)
1
  Algunos
Repeat for t = 1, 2, . . .
Receive unlabeled point x
conceptos
If there is any disagreement within H a
A generic mellow learner [CAL ’91] query label y and set H = {h ∈ H
importantes:
else
t
t
t
For separable
that is streaming
in.
Ht+1 = Ht
◦  Concept
class /data
hipothesis
class
t+1
Is a label needed?
H1de
= hypothesis
class (hipotesis)
  Conjunto
clasificadores
Problems:
Repeat for t = 1, 2, . . .
que puede
explorar el algoritmo de
AL
(1) intrac
◦  Version space
(Ht)
else
Is a label needed?
Ht+1 = Ht
  Conjunto de clasificadores
hypotheses
consistentes con las etiquetas
actuales
Problems: (1) intractable to maintain H ; (2)
t
◦  Disagreement set
  Conjunto de muestras con
discrepancias en el “espacio de
versiones”.
Is a label needed?
hypotheses
Active learning.
MLG.
Marzo, 2013
Problems: (1) intractable to maintain
Ht ; (2)
nonseparable
data.
18
Receive unla
[CAL ’91]
If there is an
else
Ht+1 = H
Algoritmo CAL
Repeat for t =
1, 2, . . .
1994)
  Si
If there is any disagreement within Ht a
generic mellow learner
[CAL ’91] query label yt and set Ht+1 = {h ∈ H
t
else
For separable data that is streaming in.
Ht+1 = Ht
elAcálculo de H es
intratable, puede hacerse
H = hypothesis class
implícitamente.
Repeat for t = 1, 2, . . .
1
Is a label needed?
etiquetadas)
else
Ht+1 = Ht
Is a label needed?
S = {} (muestras
Repeat
Captura xt
hypotheses
Si entrena(S U (xk,1)) y entrena(S U
Problems: (1) intractable to maintain Ht ; (2)
(xk, 0) devuelven una respuesta
Solicita yt,
en caso contrario
toma
ytneeded?
igual a la etiqueta
con candidate
Is a label
Ht = current
hypotheses
respuesta.
Problems: (1) intractable to maintain Ht ; (2) nonseparable data.
19
Receive unla
[CAL ’91]
If there is an
else
Ht+1 = H
Algoritmo CAL
Repeat
for t = 1, 2,
...
(Generic Mellow Learner)
(Cohn,
1994)
Receive unlabeled point x
If there is any disagreement within H a
  Label
complexity
(Hanneke,
):
A generic
mellow learner
[CAL ’91]2007
query label y and set H
= {h ∈ H
t
t
else
For separable
data(dimensión
that is streaming in.
◦  Noise-free
case
d):
H
t+1
t
t+1
= Ht
1 = hypothesis class
  Passive Hlearning:
d/ε
Repeat for t = 1, 2, . . .
Receive unlabeled θ
point
  CAL:
d /xtlog(1/ ε)
Is a label needed?
else
2
Ht+1 = Ht
Is a label needed?
◦  Noisy (but realizable) case:
  Passive learning:
d/ε
hypotheses
  CAL (modificado):
Problems: (1) intractable to maintain Ht ; (2)
θ (d / log2(1/ ε) + d ν2/ε2)
Is a label needed?
hypotheses
Problems: (1) intractable to maintain Ht ; (2) nonseparable data.
20
Búsqueda binaria generalizada
(Splitting Algorithm (SA)) (~1970)
  Es
anterior a CAL, pero mejor…
  Algoritmo:
H1 = Hypothesis class
Repeat
Captura xt que maximice la
discrepancia entre Ht sobre la
etiqueta de xt,
Solicita yt,
Ht+1 ={h∈Ht : h(xt)=yt}
Si no hay muestras controvertidas
Ht+1 = Ht Active learning. MLG. Marzo, 2013
21
Extensión a casos con ruido
  Stochastic
version space
◦  (se mantiene un espacio de versiones Ht con
todas las hipótesis cuyos errores se puedan
“explicar” por el ruido)
  Repeated
querying
◦  (toma varias etiquetas de muestras con
incertidumbre para aumentar la confianza en la
asignación de clase).
  Hypothesis
weighting
◦  (pondera cada hipótesis de acuerdo con su
capacidad de predicción).
22
QBC
(Query By Committee) (Freund, 1997)
 
 
 
Es (solo en cierto sentido) una versión bayesiana de CAL.
También cuantifica, como SA, el grado de discrepancia sobre una muestra.
Hay KQBC (Gilah-Barrach, 2005), con código Matlab disponible http://
www.cs.huji.ac.il/labs/learning/code/qbc.
23
QBC
(Query By Committee) (Freund, 1997)
  Ejemplo. KQBC para clasificación de
género en imágenes de caras
24
FAQ.
  ¿Funciona
  ¿Funciona
siempre el aprendizaje activo?
◦  Dicho así, no.
25
Repeat
Fit a classifier to the labels seen so far
Query the unlabeled point that is closest to the boundary
(or most uncertain, or most likely to decrease overall
uncertainty,...)
Sampling bias (Dasgupta, 2009)
  CAL
Example:
en un escenario agnóstico:
45%
5%
5%
45%
◦  Incluso con infinitas muestras, CAL puede
converger a un clasificador con un 5 % de
error, en lugar del mejor error alcanzable, del
2.5 %
  è No hay consistencia.
  ç El etiquetado sesga la distribución de los datos.
◦  El problema se plantea en escenarios
prácticos (Schutze et al, 2003)
26
FAQ.
  ¿Funciona
  ¿Funciona
  ¿Funciona
siembre el aprendizaje activo
bien hecho?
◦  Asintóticamente, si... o, al menos, no estropea.
27
Passive vs active
  CAL
algorithm, Agnostic learning
◦  Si h* no está en H, ¿pueden garantizarse al menos
prestaciones iguales a las del aprendizaje pasivo?
◦  En teoría, asintóticamente, sí:
  Divide el presupuesto de muestras en tres partes
iguales:
  1/3 para active learning à hA.
  1/3 para passive learningàhP
  1/3, tomado de la región de discrepancia entre hA y hP. Quédate
con la mejor hipótesis.
◦  De este modo, un algoritmo de aprendizaje activo
nunca es catastrófico… incluso si es malo.
28
The A Algorithm in Action
0
0
0.5
◦  Problema:
Error Rate
Problem: find the optimal threshold function on the [0, 1] interva
in a noisy domain.
Labeled
  Encontrar la función de
umbral óptima en [0, 1]
◦  Pasos:
◦  1. Muestrea.
Samples
0
0 1
0
0
1
0
10
1
0
Bounding
0
1
Threshold/Input Feature
0.5
Label samples at random. Upper Bound
  Etiqueta muestras al azar
◦  2. Acota el error
◦  3. Eliminación
1
Error Rate
Agnostic)
(Active
Sampling
Lower
Bound
Eliminating
  Descarta el etiquetado de
muestras en la zona
eliminada.
0
0 1
0
0
1
0
10
1
0
0
1
0.5
Elimination
Compute upper and lower bounds
on the error rate of each
hypothesis.
Theorem: For all H, for all D, for all numbers of samples m,
Error Rate
  Algoritmo A2
Error Rates of Threshold Function
Error Rate
El caso agnóstico
0.5
0 rate
0 1− empirical
0 0 error
1 rate|
0
1 ≥1−
Pr(|true error
≤1 0f (H, δ, m))
0
Active0 learning. MLG.
Marzo, 2013
291
A2 (error ε, clasificadores H)
  Pasos:
Agnostic Active (error rate !, classifiers H)
while Done(H, D) > !:
S = ∅, H ! = H
while Disagree(H ! , D) ≥ 21 Disagree(H, D):
if Done(H ! , D) < !: return h ∈ H !
S ! = 2|S| + 1 unlabeled x which H disagrees on
S = {(x, Label(x)) : x ∈ S ! }
H ! ← {h ∈ H : LB(S, h) ≤ minh! ∈H UB(S, h! )}
H ← H!
return h ∈ H
30
A2 (error ε, clasificadores H)
Agnostic Active: result
  En
teoría, A2 funciona bien
Theorem: There exists an algorithm Agnostic Active that:
1. (Correctness) For all H, D, ! with probability 0.99 returns an
!-optimal c.
2. (Fall-Back) For all H, D the number of labeled samples
required is O(Batch).
3. (Structured Low noise) For all H, D with disagreement
"
!
coefficient θ and d = VC (H) with ν < !, Õ θ 2 d ln2 1!
labeled examples suffice.
4. (Structured High noise) For all H, D with disagreement
# 2 2 $
coefficient θ and d = VC (H) with ν > !, Õ θ !ν2 d labeled
examples suffice.
31
Agnostic-Active (error ε,
clasificadores H)
What’s wrong with A2 ?
  Problemas
de A2
1. Unlabeled complexity You need infinite unlabeled data to
measure Disagree(C , D) to infinite precision.
2. Computation You need to enumerate and check
hypotheses—exponentially slower and exponentially more
space than common learning algorithms.
3. Label Complexity Can’t get logarithmic label complexity for
! < ν.
4. Label Complexity Throwing away examples from previous
iterations can’t be optimal, right?
5. Label Complexity Bounds are often way too loose.
6. Generality We care about more than 0/1 loss.
32
Bayesian Active Learning
CAL y QBC son, en cierto sentido, bayesianos.
  Los métodos bayesianos proporcionan de forma
natural un estimador y una medida de
incertidumbre del estimador.
  Idea inicial:
 
◦  Aprendizaje pasivo: selecciona datos D que minimizan
la incertidumbre (entropía) a posteriori (NP-hard)
◦  Aprendizaje activo (myopic): en cada paso, elige la
muestra que minimiza la incertidumbre a posteriori.
 
Dos muestras:
◦  Clasificación (Housby, 2011)
◦  Regresión (Guestrin, 2005) (Krause, 2008)
33
Bayesian AL
  Near-Optimal
Sensor
Placement (Guestrin,
2005) (Krause, 2008)
◦  Aplican GP para
regresión
34
Bayesian AL
  BAL
for Classification (Houlsby, 2011)
35
FAQ.
  ¿Funciona
  ¿Funciona
  ¿Funciona
siempre el aprendizaje activo
bien hecho?
◦  Asintóticamente, si... o, al menos, no estropea.
  Entonces, ¿cuándo
funciona bien el
aprendizaje activo?
36
Algunos resultados teóricos
 
AL realizable:
◦  El aprendizaje activo puede reducir exponencialmente la
necesidad de datos etiquetados (e.g. Estimación de umbral).
◦  La ganancia exponencial es una cota minimax en muchos
problemas (DasGupta, 2005), e.g. persiste en las circunstancias
más desfavorables (para una distribución p(x) dada y un espacio
de hipótesis dado).
◦  Resultados negativos: existen hipótesis objetivo y distribuciones
para las que ningún algoritmo AL verificable (que determine
adaptativamente cuántas muestras necesita) puede tener una
label complexity minimax no es mejor que la del aprendizaje
pasivo (Balcan, 2010)…
◦  … pero, para algoritmos verificables: para cualquier espacio de
hipótesis con dimensión VC finita y cualquier distribución fija de
datos, para todo algoritmo pasivo dado, existe un algoritmo AL
con label complexity asintóticamente superior (Balcan, 2010)
◦  Todo algoritmo de aprendizaje pasivo puede activarse (Hanneke,
2012)
37
Algunos resultados teóricos
  AL
agnóstico:
◦  En algunos casos, A2 reduce exponencialmente la
necesidad de etiquetado (Balcan, 2009)
◦  Muchas cotas del error que mejoran al
aprendizaje pasivo en factores constantes…
◦  Sin imponer condiciones sobre el ruido, no es
posible obtener mejoras superiores a factores
multiplicativos (Kaariainen, 2006)
◦  Bajo ciertas condiciones sobre el ruido, pueden
obtenerse mejoras exponenciales (Castro, 2008, y
otros)
◦  Muchos problemas no resueltos.
38
Elementos de la teoría del
aprendizaje activo
 
Excess risk:
◦  (G es la región en la que se decide 1, G* es la G del MAP).
 
Expected excess risk:
 
Sample complexity:
39
aprendizaje activo
  Sample
noise
40
aprendizaje activo
  Label
noise
41
Lidiando con el ruido.
  Caso
acotado.
42
DIS(V ) = {x
Disagreement coefficient
[Hanneke]
Disagreement coefficient fo
Let P be the underlying probability distribution on input
Induces (pseudo-)metric on hypotheses: d(h, h! ) = P[h(X
Corresponding notion of ball B(h, r ) = {h! ∈ H : d(h, h! )
Disagreement coefficient
h
Es un concepto
clave en el análisis de
Disagreement coefficient [Hanneke] DIS(V ) = {x :h*∃h, h ∈ V such that h(x) !=
algoritmos de aprendizaje activo.
coefficient
[Hanneke]
Disagreement
coefficient
target
Let
P be the underlying probability
distribution
on inputfor
space
X .hypothesis h ∈ H:
  Bases:
Induces (pseudo-)metric on hypotheses: d(h, h ) = P[h(X ) != h (XP[DIS(B(h
)].
, r ))]
 
reement
Disagreement region of any set of candidate hypotheses
!
∗
!
!
∗
◦  Toda
distribución
de entrada
P(x) space
induceX . d(h , h) = P[shadedr region]
P be the underlying
probability
distribution
on input
una
métrica
entred(h,
hipótesis
Disagreement
region
of
set of
candidate
hypotheses
ces (pseudo-)metric(clasificadores):
on hypotheses:
h!any
)=
P[h(X
) !=
h! (X V)].⊆ H:
h
! ) = {x : ∃h,
DIS(V
h ∈hV! )such
esponding notion of  ball
B(h,
r
)
=
{h
∈
H
:
d(h,
< that
r }.h(x) != h (x)}.
d(h, h’) = P{h(X) ≠ h’(X)}. h*
sup
Corresponding notion of ball B(h, r ) = {h ∈ H : d(h, h ) θ<=
r }.
∗
!
!
.
r
!
!
◦  Región de discrepancia de un conjunto
P[DIS(B(h , r ))]
θ = sup
V:
greement region of de
anyhipótesis,
set of candidate
hypotheses
Vr ⊆ H:.
Disagreement coefficient for target hypothesis h∗ ∈ H:
∗
r
d(h∗ , h) = P[shaded region]
  DIS(V) = {x:∃h, h’∈V tales
que h(X) ≠
h’(X)}.
DIS(V ) = {x
: ∃h, h! ∈ V such that h(x) != h! (x)}.
Some elements of B(h∗ , r )
h
Coeficiente
de discrepancia para una
h*
greement coefficient
for target
hypothesis
hipótesis
objetivo
h* h∗ ∈ H:
 
∗
P[DIS(B(h
, r ))]
∗ , h) = P[shaded region]
Some .elements of B(h∗ , r )
θ =d(hsup
r
r
DIS(B(h∗ , r ))
43
Estrategias de Active Learning
(Settes, 2010).
 
Algoritmos para determinar qué datos deben
etiquetarse:
◦  Uncertainty sampling:
  Etiqueta aquellos puntos para los que el modelo actual está
menos seguro acerca de cuál es la etiqueta correcta.
◦  Query by committee
  Se entrenan varios modelos con las etiquetas disponible, y se vota
sobre la salida de los datos no etiquetados. Se etiquetan las
muestras más controvertidas.
◦  Expected model change
  Etiqueta los puntos que más podrían alterar el modelo actual.
◦  Expected reduction error
  Etiqueta los puntos que más podrían reducir el error de
generalización
◦  Variance reduction
  Etiqueta los puntos que podrían minimizar la varianza de la salida.
44
Problemas relacionados con AL
 
Problemas que son AL
◦  Diseño Óptimo de Experimentos
  Análogo AL para regresión.
◦  Selective sampling
  Active Learning muestra a muestra
 
Problemas relacionados con AL
◦  Dataset shift
  Los algoritmos de AL distorsionan la distribución de
entrada, que resulta por tanto diferente a la de test.
◦  Semi-supervised learning
  También, muchos datos no etiquetados.
45
Active vs Semi-Supervised Learning
  z
46
AL vs Semisupervised Learning.
Case I: Exploiting cluster structure in data
Suppose the unlabeled data looks like this.
  Aprovechando
la estructura
de clusters en los datos:
A cluster-based
active active
learning
schemescheme
A cluster-based
learning
◦  Si los datos forman
clusters,
quizás bastaría con solicitar una
sola etiqueta por clase…
[ZGL 03][ZGL 03]
Then
perhaps
just (2)
need
five
labels!
(1) Build(1)
neighborhood
graph we
Query
some
random
Build
neighborhood
graph
(2)
Query
some points
random poin
0
cluster-based learning
scheme (Zhu, 2003):
0
  A
A cluster-based
activeactive
learning
scheme
[ZGL 03]
A cluster-based
learning
scheme
[ZGL 03]
(1) Build
graph graph
(1)neighborhood
Build neighborhood
1
Propagate
labels labels
(2) Query
some random
points(3)
(3) Propagate
(2) Query
some random
points
.4
.2
.4
.2
.6
0
0
0
(3) Propagate
labels labels
(3) Propagate
.2
.4
.2
.6
0
.2
.5
.6
0
.8
.4
.2
0
.8 .7
.7
.81
.5
.6
0
.8
.7
.8
.5
1
.7.8
.81
.7
.8
.4
.5
.6
0
.6
.6
.7.8
.4
.2
(4) Make
goand
to (3)
(4) query
Make and
query
go to (3)
.4
.5
1
1
.4
.5
0
.8
1
(4) Make(4)
query
andquery
go toand
(3)go to (3
Make
.2
.5
1
.5
.6
0
.8
.8 .7
.7
47
AL vs Semisupervised Learning
  Hierarchical
sampling from hiearchical
clusters
(Dasgupta,
2008)
Algorithm:
hierarchical
sampling
Using a hierarchical clustering
Input: Hierarchical clustering T
For each node v maintain: (i) majority label L(v ); (ii) empirical
lb
ub
label frequencies !
pv,l ; and (iii) confidence interval [pv,l
, pv,l
]
Initialize: pruning P = {root}, labeling L(root) = !0
for t = 1, 2, 3, . . . :
!
!
!
!
!
v = select-node(P)
pick a random point z in subtree
Tv aand
query its label
Using
hierarchical
clustering
update empirical counts for all nodes along path from z to v
choose best pruning and labeling (P ! , L! ) of Tv
P = (P \ {v }) ∪ P ! and L(u) = L! (u) for all u in P !
for each v in P: assign each leaf in Tv the label L(v )
return the resulting fully labeled data set
v = select-node(P) ≡
"
Prob[v ] ∝ |Tv |
Rules:
lb
Prob[v ] ∝ |Tv |(1 −
pv,l
)
!
Rules:
!
Always work with some pruning of the hier
induced by the tree. Pick a cluster (intellig
random point in it.
!
For each tree node (i.e. cluster) v maintai
(ii) empirical label frequencies !
pv,l ; and (iii
lb
ub
[pv,l
, pv,l
]
random sampling
active sampling
Always work with some pruning of the hierarchy: a clustering
induced by the tree.
Pick learning.
a cluster (intelligently)
query a
Active
MLG. Marzo,and
2013
random point in it.
48
Variantes de AL
  Active
Feature Acquisition and Classification
(Zheng, 2002)
  Active Class Selection (Lomasky, 2007)
  Active Clustering
  Learning from multiple teachers:
◦  El algoritmo puede tomar decisiones sobre quién
proporciona la etiqueta
  Learning
from crowds / multiple annotators
◦  Los anotadores no son igualmente fiables. Puede
haber anotadores maliciosos.
  AL
para salidas estructuradas (Settles, 2008)
49
  AL
in batch mode
  Costes
de etiquetado variables
50
  Multi-instance
active learning
  Multi-task active learning
  AL with stopping rules
  Submodular optimization
  Equivalence query learning
  AL form partial labels (TBD)
51
Ahorro de etiquetas con datos
“reales”
  (DasGupta, 2009)
52
Recursos
 
Sitios:
 
Software:
◦  active-learning.net
◦  DUALIST
  Active learning tool for text processing soliciting feedback on
both instances and features, with a web-based user interface in
Java
◦  Vowpal Wabbit
  C++ library focused on large-scale and online machine
learning, which includes selective sampling algorithms
◦  Curious Snake
  Small active learning library for PythonDUALIST (Java)
◦  Código Matlab para KQBC (Gilah-Barrach, 2005),
  http://www.cs.huji.ac.il/labs/learning/code/qbc.
53
Referencias.
 
Tutoriales, reviews:
◦  S. Dasgupta, J. Langford, Active Learning Tutorial (Slides), ICML, 2009.
◦  B. Settles, Active Learning Literature Survey, Computer Sciences TR 1648,
Univ. Wisconsin-Madison, 2009
◦  B. Settles. From theories to queries: Active learning in practice, Active
Learning and Experimental Design, 2011.
 
Algoritmos:
◦  D. Cohn, L. Atlas, R. Ladner, Improving generalization with active
learning, Machine Learning, 1994.
◦  S. Dasgupta, D. Hsu. Hierarchical sampling for active learning. ICML 2008.
◦  Y. Freund, H.S. Seung, E. Shamir, N. Tishby, Selective Sampling Using the
Query By Committee Algorithm, Machine Learning, 1997.
◦  S. Hanneke, A Bound on the Label Complexity of Agnostic Active
Learning. ICML 2007.
◦  S. Hanneke, Activized Active Learning, JMLR, 2012
◦  S. Minsker, A plug-in approach to active learning, JMLR, 2012
◦  X. Zhu, J. Lafferty, Z. Ghahramani, Combining active learning and semisupervised learning using Gaussian fields and harmonic functions, ICML
2003 workshop.
54
Referencias.
 
Aplicaciones
◦  Y. Abramson, Y. Freund, Active Learning for Visual Object
Recognition, UCSD Tech Report.
◦  C. Guestrin, A. Kause, A.P. Singh, Near-Optimal Sensor
Placements in Gaussian Processes, ICML, 2005.
◦  F. Olsson, A Literature Survey of Active Machine Learning in the
Context of Natural Languaje Processing, SICS Technical Report,
T2009:06, 2009.
◦  N. Rubens, D. Kaplan, M. Sugiyama. Recommender Systems
Handbook: Active Learning in Recommender Systems, Springer, 201.
◦  M.K. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C. Lemmon,
Active Learning in the Drug Discovery Process, NIPS 2001.
◦  V. Krishnamurthy, Algorithms for optimal scheduling and
management of hidden Markov model sensors, IEEE TSP, 2002
◦  A. Krithara, Active, Semi-Supervised Learning for Textual
Information Access, IAS, 2006.
 
Otros
◦  R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, X. Zhu. Human
active learning, NIPS, 2008
55
Trends
1. 
2. 
3. 
4. 
Active
learning
machine
learning
Probabilistic
programming
Deep learning
Big data
56
Investigar en/con AL
  Teoría
de AL
◦  Hay muchos problemas abiertos, pero no nos
dedicamos a esto.
  Nuevos
algoritmos
◦  Nueva algoritmia para AL en clasificación
◦  AL y redes de sensores
◦  AL y partial labels
  Uso
de AL
◦  Podría facilitarnos el etiquetado, siempre que
tengamos claro a priori el esquema de
clasificación…
57

Active Learning

Transcripción

Documentos relacionados

Diapositiva 1 - WordPress.com

OFF TO A GOOD START: Formative Assessment for Kindergartners