Archit Gupta
IIT Delhi
Rahul Agrawal
Microsoft AdCenter
Manik Varma
Microsoft Research India
Yashoteja Prabhu
Microsoft Research India
Multi-Label Learning with
Millions of Labels for
Query Recommendation
Recommending Advertiser Bid Phrases
geico auto insurance
geico car insurance
geico insurance
www geico com
care geicos
geico com
need cheap auto insurance
wisconsin cheap car
insurance quotes
cheap auto insurance
florida
all state car insurance
coupon code
Query Rewriting
geico auto insurance
geico car insurance
geico insurance
www geico com
care geicos
geico com
need cheap auto insurance
wisconsin cheap car
insurance quotes
cheap auto insurance
florida
all state car insurance
coupon code
Absolutely
cheapest car
insurance”
Ranking & Relevance Meta Stream
geico auto insurance
geico car insurance
geico insurance
www geico com
care geicos
geico com
need cheap auto insurance
wisconsin cheap car
insurance quotes
cheap auto insurance
florida
geico twitter
Recommending Advertiser Bid Phrases
geico auto insurance
geico car insurance
geico insurance
www geico com
care geicos
geico com
need cheap auto insurance
wisconsin cheap car
insurance quotes
cheap auto insurance
florida
all state car insurance
coupon code
Learning to Predict a Set of Queries
X: Ads
Y: Queries
f : X 2
Y
car insurance
italian
restaurant
iphone
geico
online
quote
need cheap
auto insurance
f ( )
Learning to Predict a Set of Queries
need cheap
auto insurance
geico
car insurance
Infinite number of labels (queries)
Training data acquisition
Efficient training
Cost of prediction
f ( )
Multi-Label Learning Challenges
need cheap
auto insurance
geico
car insurance
Infinite number of labels (queries)
Training data acquisition
Efficient training
Cost of prediction
h( , geico)
Binary Classification & Ranking
h : (X, Y) { , }
h( , iphone)
Infinite number of labels (queries)
Training data acquisition
Efficient training
Cost of prediction
Binary Classification
car insurance
italian
restaurant
iphone
geico
online
quote
need cheap
auto insurance
h : (X, Y) { , }
Infinite number of labels (queries)
Training data acquisition
Efficient training
Cost of prediction
Binary Classification KEX
h : (X, Y) { , }
car insurance
geico
online
quote
switching
to geico
Query Recommendations by KEX
Query Recommendations by KEX
h( , car insurance) ?
h( , iphone) ?
Query Recommendations by KEX
plastic ponies
simone
plastics
clothing and
accessories
sylvia
pony clothing
couture
playground
plastic recycling
children's clothing
Multi-Label Learning Formulation
X: Ads
Y: Queries
car insurance
italian
restaurant
iphone
geico
online
quote
need cheap
auto insurance
f : X 2
Y
Learning with Millions of Labels
X: Ads
Y: 10 Million Queries
car insurance
italian
restaurant
iphone
geico
online
quote
need cheap
auto insurance
f : X 2
Y
We develop Multi-Label Random Forests with logarithmic
prediction costs that make predictions in a few milliseconds.
We train on 200 M points, 100 M categories and 10 M
features in 28 hours on a grid with 1000 compute nodes.
We develop a tree growing criterion which learns from
positive data alone.
We generate training data automatically from click logs.
We develop a sparse SSL formulation to infer beliefs about
the state of missing and noisy labels.
Multi-Label Random Forests
No annotator can mark all the relevant labels for a data point.
Training Data Missing Labels
We have missing labels
during
Training
Validation
Testing.
Even fundamental ML
techniques such as
validation can go awry.
One can’t design error
metrics invariant to
missing labels.
Training Data and Features
TF-IDF Bag of
Words Features
iphone
color
material
Training Labels
TF-IDF Bag of
Words Features
iphone
color
material
case for iphone best iphone case
apple
iphone
3g metallic slim fit
case
best iphone nn4 cases
iphone cases best iphone cases apple iphone 4g cases
black white premium bumper
case apple iphone nn4 att
best iphone nn4 case case iphone
apple iphone 4g premium soft
silicone rubber black phone
protector skin cover case
bunny rabbit silicone case skin
iphone nn4 stand tail holder
iphone 3gs cases
otterbox universal defender case
iphone nn4 black silicone black
plastic
apple iphone nn4 cases iphone case
iphone 4s case sena iphone cases
belkin grip vue tint case iphone
nn4 clear
iphone 4g cases
iphone case speck iphone case best case iphone 4s iphone 4gs cases
iphone nn4 case
switcheasy neo case iphone 3g
black
best case iphone nn4 iphone 4s defender series case
3g iphone cases waterproof iphone case best iphone 3g cases iphone case design
apple iphone cases waterproof iphone cases best iphone 4s case iphone cases 3g
best iphone 3g case
amazonbasics
protective tpu
case
screen protector att verizon
iphone nn4 iphone 4s clear
best iphone 4s cases iphone cases 4g
Training Labels
Missing and Noisy Labels
best italian restaurants
philadelphia
italian restaurant chains
italian restaurants
italian
restaurant
connecticut
italian restaurant
italian restaurant district
columbia
italian restaurants arkansas thai restaurant
italian restaurants connecticut thai restaurants
italian restaurants idaho restaurants
italian restaurants phoenix mexican restaurants
Missing and Noisy Labels
Biased Training Data
Query
Frequency
Zipf's Law
Most labels will have very few positive training examples
Multi-Label Prediction Costs
1-vs-All Classification
Linear prediction costs are infeasible
geico
car insurance
iphone cases
pizza
Label and Feature Space Compression
1K Dimensional Embedding Space
10M Dimensional Label Space
6M Dimensional Feature Space
car
auto
motor
vehicle
iphone
cases
cases
iphone
Car
Ads
iphone
Case Ads
iphone case
Hierarchical Prediction
Prediction in logarithmic time
Gating Tree Based Prediction
0
0.1
0.2
0.3
0.4
Is the word “insurance” present in the ad?
Is the word geico
present in the ad?
Yes
Yes
No
No
Prediction in logarithmic time
Ensemble of Randomized Gating Trees
0
0.1
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
We seek classifiers and optimization algorithms that
Are massively parallelizable
Don’t need to load the feature vectors (1 Tb) into RAM
Don’t need to load the label matrix (100 Gb) into RAM
Efficient Training
Number of training points 200 Million
Number of labels 100 Million
10 Million
Number of cores 500 1000
RAM per core 2 Gb
Training time 28 hours
0
0.05
0.1
0.15
0.2
Multi-Label Random Forests
0
0.05
0.1
0.15
0.2
The splitting cost needs to be calculated in a 2
10M
space
Is the word “insurance” present?
0
0.2
0.4
0.6
l1 l2 l3
Learning from Positively Labeled Data
Split condition :



  
  
  


0
0.2
0.4
0.6
0.8
l1 l2 l3
0
1
l1 l2 l3
Multi-Label Random Forests
0
1
l1 l2 l3
0
1
l1 l2 l3
0
0.5
l1 l2 l3







Query Recommendation Data Sets
Data set statistics
Data Set
# of Training
Points (M)
#
of Test
Points (M)
#
of Dimensions
(M)
# of Labels
(M)
Wikipedia
1.53 0.66 1.89 0.97
Ads1
8.00 0.50 1.58 1.22
Web
40.00 1.50 2.62 1.22
Ads2
90.00 5.00 5.80 9.70
We use loss functions where the penalty incurred for predicting the
real (but unknown) ground truth is never more than that of predicting
any other labelling





Hamming Loss
Precision at k
We found Precision at 10 to be robust for our application.
Performance Evaluation Precision@k
Query Recommendation Results
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Wikipedia Ads1 Web Ads2
MLRF
KEX
Percentage of top 10 predictions that were clicked queries
Query Recommendation Results
0.00
10.00
20.00
30.00
40.00
50.00
60.00
Wikipedia Ads1 Web Ads2
MLRF
KEX
Percentage of top 10 predictions that were relevant
Geico Car Insurance
KEX MLRF
geico auto insurance
geico car insurance
geico insurance
www geico com
care geicos
geico com
need cheap auto insurance
wisconsin
cheap car
insurance
quotes
cheap auto insurance florida
all state car insurance coupon
code
Domino’s Pizza
KEX MLRF
dominos
dominos pizza
domino pizza
domino pasta bowls
domino pizza
coupons
domino pizza deals
domino pizza
locations
domino pizza menu
domino pizza online
Simone & Sylvia Kid’s Clothing
KEX MLRF
plastic ponies toddlers clothes
simone toddlers clothing
plastics toddler costumes
clothing and accessories children clothes sale
sylvia children clothes
pony clothing
designer children clothes
couture cute children clothes
playground retro clothing
Plastic recycling retro baby clothes
children's clothing baby clothing
KCS Flowers
KEX MLRF
funeral flowers flowers delivery
sympathy funeral flowers funeral arrangements
web home birthday flowers
bleitz funeral home funeral flowers
funeral flowers discount funeral planning
yarington's funeral home flowers valentines
harvey funeral home free delivery flowers
green lake funeral home cheap flowers
howden kennedy funeral
home
florists
arranging flowers cheap flowers funeral
Vistaprint Designer T-Shirts
KEX MLRF
embroidered apparel
custom t shirts
custom apparel funny t shirts
readymade apparel
hanes
beefy t shirts
customizable hanes t shirts
apparel long sleeve t shirts
customizable apparel
personalized t shirts
leading print printed t shirts
online business cards
retro gamer t shirts
apparel and accessories
t shirts
own text buy custom t shirts
Metlife Auto Insurance
KEX MLRF
metlife
auto home insurance
metlife auto insurance
auto home insurance auto Insurance
auto insurance car Insurance
massachusetts automobile Insurance
metlife agent geico insurance
driver discount cheap car insurance
additional cost metlife auto
saving benefits insurance broker
car discount insurance
auto quote home insurance
Wanta Thai Restaurant
KEX MLRF
authentic
thai
restaurant
thai restaurant
delicious thai food thai restaurants
thai cuisine mexican restaurants
thai restaurant cheap hotels
thai food hotels
wanta fast food restaurants
best thai restaurant restaurants coupons
thai eateries
best web hosting
restaurants
thai vegetarian foods
contemporary thai
new
york
restaurants
best italian restaurants
philadelphia
italian restaurant chains
italian restaurants
italian
restaurant
connecticut
italian restaurant
italian restaurant district
columbia
italian restaurants arkansas thai restaurant
italian restaurants connecticut thai restaurants
italian restaurants idaho restaurants
italian restaurants phoenix mexican restaurants
Compensating for Missing Labels
Progressive insurance
Allstate auto insurance
American family insurance
Esurance
Auto insurance quotes
Case-mate phone cases
Maggiano’s restaurant
0.9
0.7
0.5
0.8
0
1
l1 l2 l3
Training on Belief Vectors
0
1
l1 l2 l3
0
1
l1 l2 l3
0
0.5
l1 l2 l3




Graph-based SSL optimizes label belief smoothness and
fidelity to original labels
F* = Min
F

  




  
s. t.

Document-document similarity matrix

Diagonal matrix representing the row sums of W

0/1 label matrix

Real valued label belief matrix
Trade-off Hyperparameter
M Number of documents
L Number of labels
K Sparsity constant
Sparse Semi-Supervised Learning
Graph-based SSL optimizes label belief smoothness and
fidelity to original labels
F* = Min





 









 

s. t.

Document-document similarity matrix

Diagonal matrix representing the row sums of W

0/1 label matrix

Real valued label belief matrix
Trade-off Hyperparameter
M Number of documents
L Number of labels
K Sparsity constant
Sparse Semi-Supervised Learning
Sparse SSL formulation
F* = Min
F

  




  
s. t.
The iterative hard thresholding algorithm converges to a
global/local optimum










Iterative Hard Thresholding
If

  and W is positive definite then
The sequence
,
, … converges to a stationary point
.



If 
then
is a globally optimal solution
If 
then
is a locally optimal solution
 

 
  
  


Iterative Hard Thresholding
Semi-Supervised Learning Results
Data Set
Click Labels (%) Human Verification (%)
MLRF
MLRF+
SSL
KEX MLRF
MLRF+
SSL
KEX
Wikipedia
15.72 18.53
11.63
24.46 27.17 17.51
Ads1
18.13 19.88
11.96
45.86 47.53 41.95
Bing
22.51 25.32
18.42
50.47 51.83 47.69
Ads2
15.91 17.12
12.45
41.28 43.78 36.69
Precision@10 as judged by automatically generated click
labels as well as by human experts.
Query Expansion Results
Data Set
Click Labels (%) Human Verification (%)
MLRF+
SSL+KSP
KEX+KSP
MLRF+
SSL+KSP
KEX+KSP
Wikipedia
18.01 10.81 31.48 22.14
Ads1
21.54 12.38 51.08 43.27
Web
26.66 19.88 53.69 48.13
Ads2
19.24 14.35 46.77 40.07
Query expansion techniques can help both KEX and MLRF
Query Recommendation Results
Edit distance [Ravi et al. WSDM 2010]
Data Set
Click Labels (%)
KEX KEX+KSP MLRF MLRF+SSL
MLRF+SSL+
KSP
Wikipedia
0.81 0.78 0.71 0.66 0.63
Ads1
0.83 0.76 0.71 0.65 0.61
Web
0.73 0.68 0.65 0.62 0.58
Ads2
0.77 0.73 0.69 0.63 0.59
Query recommendation can be posed as multi-label
learning.
Learning with millions of labels can be tractable and
accurate.
Other applications
Query expansion.
Document and ad relevance and ranking.
Fine-grained query intent classification.
Conclusions
Deepak Bapna
Prateek Jain
A. Kumaran
Mehul Parsana
Krishna Leela Poola
Adarsh Prasad
Varun Singla
Acknowledgements
Can generalize to other domains such as images on Flickr or videos
on YouTube.
Advantages of an ML Approach
System Architecture
Evaluator
1
Combiner
1
Maximizer 1 Maximizer 2
Combiner
2
Combiner
3
Evaluator
4
Evaluator
3
Evaluator
2
We leverage the Map/Reduce
framework.
Trees are grown in parallel
breadth-wise.
Number of compute nodes
Evaluators 500
Combiners 100
Maximizers 25
Our objective is to balance
the compute load across
machines while minimizing
data flow
X
N+1
,Y
N+1
to
X
2N
, Y
2N
X
2N+1
,Y
2N+1
to
X
3N
, Y
3N
X
3N+1
,Y
3N+1
to
X
4N
, Y
4N
X
1
,Y
1
to
X
N
, Y
N
F*, T*
Evaluators
Evaluator
1
Combiner
1
Maximizer 1 Maximizer 2
Combiner
2
Combiner
3
Evaluator
4
Evaluator
3
Evaluator
2
Input
N training instances
Set of keys Tree ID, Node
ID, Feature ID and
threshold
Output
Partial label distributions
for the keys
Computation
N * # of keys
X
N+1
,Y
N+1
to
X
2N
, Y
2N
X
2N+1
,Y
2N+1
to
X
3N
, Y
3N
X
3N+1
,Y
3N+1
to
X
4N
, Y
4N
X
1
,Y
1
to
X
N
, Y
N
F*, T*
Combiners
Evaluator
1
Combiner
1
Maximizer 1 Maximizer 2
Combiner
2
Combiner
3
Evaluator
4
Evaluator
3
Evaluator
2
Input
Partial label distributions
for assigned keys
Output
Objective function values
for the keys.
Computation
# of keys * Avg # of
Evaluators / key * # of
labels in the distribution
for the key.
X
N+1
,Y
N+1
to
X
2N
, Y
2N
X
2N+1
,Y
2N+1
to
X
3N
, Y
3N
X
3N+1
,Y
3N+1
to
X
4N
, Y
4N
X
1
,Y
1
to
X
N
, Y
N
F*, T*
Maximizers
Evaluator
1
Combiner
1
Maximizer 1 Maximizer 2
Combiner
2
Combiner
3
Evaluator
4
Evaluator
3
Evaluator
2
Input
Objective function values
for assigned keys
Output
Optimal feature and
threshold for assigned
nodes in trees.
Computation
# of keys * Avg # of
features per key * Avg # of
thresholds per feature
X
N+1
,Y
N+1
to
X
2N
, Y
2N
X
2N+1
,Y
2N+1
to
X
3N
, Y
3N
X
3N+1
,Y
3N+1
to
X
4N
, Y
4N
X
1
,Y
1
to
X
N
, Y
N
F*, T*