brueckner12a.dvi

Journal of Machine Learning Research 13 (2012) 2617-2654 Submitted 8/11; Revised 5/12; Published 9/12

Static Prediction Games for Adversarial Learning Problems

Michael Br

uckner MIBRUECK@CS.UNI-POTSDAM.DE

Department of Computer Science

University of Potsdam

August-Bebel-Str. 89

14482 Potsdam, Germany

Christian Kanzow KANZOW@MATHEMATIK.UNI-WUERZBURG.DE

Institute of Mathematics

University of W

urzburg

Emil-Fischer-Str. 30

97074 W

urzburg, Germany

Tobias Scheffer SCHEFFER@CS.UNI-POTSDAM.DE

Department of Computer Science

University of Potsdam

August-Bebel-Str. 89

14482 Potsdam, Germany

Editor: Nicol

o Cesa-Bianchi

Abstract

The standard ass umption of identically distributed training and test data is violated when the test

data are generated in response to the presence of a predictive model. This becomes apparent, for

example, in the context of email spam ﬁltering. Here, email service providers employ spam ﬁl-

ters, and spam senders engineer campaign templates to achieve a high rate of successful deliveries

despite the ﬁlters. We model the interaction between the learner and the data generator as a static

game in which the cost functions of the learner and the data generator are not necessarily antag-

onistic. We identify conditions under which this prediction game has a unique Nash equilibrium

and derive algorithms that ﬁnd the equilibrial prediction model. We derive two instances, the Nash

logistic regression and the Nash support vector machine, and empirically explore their properties

in a case study on email spam ﬁltering.

Keywords: static prediction games, adversarial classiﬁcation, Nash equilibrium

1. Introduction

A common assumption on which most learning algorithms are based is that training and test data

are governed by identical distributions. However, in a variety of applications, the distribution that

governs data at application time may be inﬂuenced by an adversary whose interests are in conﬂict

with those of the learner. Consider, for instance, the following three scenarios. In computer and

network security, scripts that control attacks are engineered with botnet and intrusion detection

systems in mind. Credit card defrauders adapt their unauthorized use of credit cards—in particular,

amounts charged per transactions and per day and the type of businesses that amounts are charged

from—to avoid triggering alerting mechanisms employed by credit card companies. Email spam

senders design message templates that are instantiated by nodes of botnets. These templates are

2012 Michael Br

uckner, Christian Kanzow and Tobias Scheffer.

UCKNER, KANZOW, AND SCHEFFER

speciﬁcally designed to produce a low spam score with popular spam ﬁlters. The domain of email

spam ﬁltering will serve as a running example throughout the paper. In all of these applications, the

party that creates the predictive model and the adversarial party that generates future data are aware

of each other, and factor the possible actions of their opponent into their decisions.

The interaction between learner and data generators can be modeled as a game in which one

player controls the predictive model whereas another exercises some control over the process of

data generation. The adversary’s inﬂuence on the generation of the data can be formally modeled as

a transformation that is imposed on the distribution that governs the data at training time. The trans-

formed distribution then governs the data at application time. The optimization criterion of either

player takes as arguments both the predictive model chosen by the learner and the transformation

carried out by the adversary.

Typically, this problem is modeled under the worst-case assumption that the adversary desires

to impose the highest possible costs on the learner. This amounts to a zero-sum game in which

the loss of one player is the gain of the other. In this setting, both players can maximize their

expected outcome by following a minimax strategy. Lanckriet et al. (2002) study the minimax

probability machine (MPM). This clas siﬁer minimizes the maximal probability of misclassifying

new instances for a given mean and covariance matrix of each class. Geometrically, these class

means and covar iances deﬁne two hyper-ellipsoids which are equally scaled such that they inters ect;

their common tangent is the minimax probabilistic decision hyperplane. Ghaoui et al. (2003) derive

a minimax model for input data that are known to lie within s ome hyper-rectangles around the

training instances. Their solution minimizes the worst-case loss over all possible choices of the data

in these intervals. Similarly, worst-case solutions to classiﬁcation games in which the adversary

deletes input features (Globerson and Roweis, 2006; Globerson et al., 2009) or performs an arbitrary

feature transformation (Teo et al., 2007; Dekel and Shamir, 2008; Dekel et al., 2010) have been

studied.

Several applications motivate problem settings in which the goals of the learner and the data

generator, while still conﬂicting, are not necessarily entirely antagonistic. For instance, a defrauder’s

goal of maximizing the proﬁt made from exploiting phished account information is not the inverse

of an email service provider’s goal of achieving a high spam recognition rate at close-to-zero false

positives. When playing a minimax strategy, one often makes overly pessimistic assumptions about

the adversary’s behavior and may not necessarily obtain an optimal outcome.

Games in which a leader—typically, the learner—commits to an action ﬁrst whereas the adver-

sary can react after the leader’s action has been disclosed are naturally modeled as a Stackelberg

competition. This model is appropriate when the follower—the data generator—has full informa-

tion about the predictive model. This assumption is usually a pessimistic approximation of reality

because, for instance, neither email service providers nor credit card companies disclose a com-

prehensive documentation of their current security measures. Stackelberg equilibria of adversarial

classiﬁcation problems can be identiﬁed by solving a bilevel optimization problem (Br

uckner and

Scheffer, 2011).

This paper studies static prediction games in which both players act simultaneously; that is,

without prior information on their opponent’s move. When the optimization criterion of both play-

ers depends not only on their own action but also on their opponent’s move, then the concept of

a player’s optimal action is no longer well-deﬁned. Therefore, we resort to the concept of a Nash

equilibrium of static prediction games. A Nash equilibrium is a pair of actions chosen such that

no player beneﬁts from unilaterally selecting a different action. If a game has a unique Nash equi-

2618

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

librium and is played by rational players that aim at maximizing their optimization criteria, it is

reasonable for each player to assume that the opponent will play according to the Nash equilib-

rium strategy. If one player plays according to the equilibrium strategy, the optimal move for the

other player is to play this equilibrium strategy as well. If, however, multiple equilibria exist and

the players choose their strategy according to distinct ones, then the resulting combination may be

arbitrarily disadvantageous for either player. It is therefore interesting to study whether adversarial

prediction games have a unique Nash equilibrium.

Our work builds on an approach that Br

uckner and Scheffer (2009) developed for ﬁnding a

Nash equilibrium of a static prediction game. We will discuss a ﬂaw in Theorem 1 of Br

uckner

and Scheffer (2009) and develop a revised version of the theorem that identiﬁes conditions under

which a unique Nash equilibrium of a prediction game exists. In addition to the inexact linesearch

approach to ﬁnding the equilibrium that Br

uckner and Scheffer (2009) develop, we will follow a

modiﬁed extragradient approach and develop Nash logistic regression and the Nash support vector

machine. This paper also develops a kernelized version of these methods. An extended empirical

evaluation explores the applicability of the Nash instances in the context of email spam ﬁltering.

We empirically verify the assumptions made in the modeling process and compare the performance

of Nash instances with baseline methods on several email corpora including a corpus from an email

service provider.

The rest of this paper is organized as follows. Section 2 introduces the problem setting. We

formalize the Nash prediction game and study conditions under which a unique Nash equilibrium

exists in Section 3. Section 4 develops strategies for identifying equilibrial prediction models, and

in Section 5, we detail on two instances of the Nash prediction game. In Section 6, we report on

experiments on email spam ﬁltering; Section 7 concludes.

2. Problem Setting

We study static prediction games between two players: The learner (v = −1) and an adversary, the

data generator (v = +1). In our running example of email spam ﬁltering, we study the competition

between recipient and senders, not competition among senders. Therefore, v = −1 refers to the

recipient whereas v = +1 models the entirety of all legitimate and abusive email senders as a single,

amalgamated player.

At training time, the data generator v = +1 produces a sample D = {(x

)}

i=1

of n training

instances x

∈ X with corresponding class labels y

∈ Y = {−1,+1}. These object-class pairs are

drawn according to a training distribution with density function p(x,y). By contrast, at application

time the data generator produces object-class pairs according to some test distribution with density

˙p(x,y) which may differ from p(x,y).

The task of the learner v = −1 is to select the parameters w ∈ W ⊂ R

of a predictive model

h(x) = sign f

(x) implemented in terms of a generalized linear decision function f

: X → R with

(x) = w

φ(x) and feature mapping φ : X → R

. The learner’s theoretical costs at application

time are given by

−1

(w, ˙p) =

∑

−1

(x,y)ℓ

−1

( f

(x),y) ˙p(x, y)dx ,

where weighting function c

−1

: X × Y → R and loss function ℓ

−1

: R ×Y → R compose the

weighted loss c

−1

(x,y)ℓ

−1

( f

(x),y) that the learner incurs when the predictive model classiﬁes

2619

UCKNER, KANZOW, AND SCHEFFER

instance x as h(x) = sign f

(x) while the true label is y. The positive class- and instance-speciﬁc

weighting factors c

−1

(x,y) with E

X,Y

−1

(x,y)] = 1 specify the importance of minimizing the loss

ℓ

−1

( f

(x),y) for the corresponding object-class pair (x,y). For instance, in spam ﬁltering, the cor-

rect classiﬁcation of non-spam messages can be business-critical for email service providers while

failing to detect spam messages runs up processing and storage costs, depending on the size of the

message.

The data generator v = +1 can modify the data generation process at application time. In prac-

tice, spam senders update their campaign templates which are disseminated to the nodes of botnets.

Formally, the data generator transforms the training distribution with density p to the test distribu-

tion with density ˙p. The data generator incurs transformation costs by modifying the data generation

process which is quantiﬁed by Ω

(p, ˙p). This term acts as a regularizer on the transformation and

may implicitly constrain the possible difference between the distributions at training and application

time, depending on the nature of the application that is to be modeled. For instance, the email sender

may not be allowed to alter the training distribution for non-spam messages, or to modify the nature

of the messages by changing the label from spam to non-spam or vice versa. Additionally, changing

the training distribution for spam messages may incur costs depending on the extent of distortion

inﬂicted on the informational payload. The theoretical costs of the data generator at application

time are the sum of the expected prediction costs and the transformation costs,

(w, ˙p) =

∑

(x,y)ℓ

( f

(x),y) ˙p(x, y)dx + Ω

(p, ˙p),

where, in analogy to the learner’s costs, c

(x,y)ℓ

( f

(x),y) quantiﬁes the weighted loss that

the data generator incurs when instance x is labeled as h(x) = sign f

(x) while the true label is

y. The weighting factors c

(x,y) with E

X,Y

(x,y)] = 1 express the signiﬁcance of (x,y) from

the perspective of the data generator. In our example scenario, this reﬂects that costs of correctly or

incorrectly classiﬁed instances may vary greatly across different physical senders that are aggregated

into the amalgamated player.

Since the theoretical costs of both players depend on the test distribution, they can, for all practi-

cal purposes, not be calculated. Hence, we focus on a regularized, empirical counterpart of the the-

oretical costs based on the training sample D. The empirical counterpart

Ω

(D,

D) of the data gen-

erator’s regularizer Ω

(p, ˙p) penalizes the divergence between training sample D = {(x

)}

i=1

and a perturbated training sample

D = {(˙x

)}

i=1

that would be the outcome of applying the trans-

formation that translates p into ˙p to sample D. The learner’s cost function, instead of integrating

over ˙p, sums over the elements of the perturbated training sample

D. The players’ empirical cost

functions can still only be evaluated after the learner has committed to parameters w and the data

generator to a transformation. However this transformation needs only be represented in terms of

the effects that it will have on the training sample D. The transformed training sample

D must not

be mistaken for test data; test data are generated under ˙p at application time after the players have

committed to their actions.

2620

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

The empirical costs incurred by the predictive model h(x) = sign f

(x) with parameters w and

the shift from p to ˙p amount to

−1

(w,

D) =

∑

i=1

−1,i

ℓ

−1

( f

(˙x

),y

) + ρ

−1

Ω

−1

(w), (1)

(w,

D) =

∑

i=1

+1,i

ℓ

( f

(˙x

),y

) + ρ

Ω

(D,

D), (2)

where we have replaced the weighting terms

(˙x

) by constant cost factors c

v,i

> 0 with

∑

v,i

1. The learner’s regularizer

Ω

−1

(w) in (1) accounts for the fact that

D does not constitute the test

data itself, but is merely a training sample transformed to reﬂect the test distribution and then used

to learn the model parameters w. The trade-off between the empirical loss and the regularizer is

controlled by each player’s regularization parameter ρ

> 0 for v ∈{−1,+1}.

Note that either player’s empirical costs

depend on both players’ actions: w ∈ W and

D ⊆

X ×Y . Because of the potentially conﬂicting players’ interests, the decision process for w and

becomes a non-cooperative two-player game, which we call a prediction game. In the following

section, we will refer to the Nash prediction game (NPG) which identiﬁes the concept of an optimal

move of the learner and the data generator under the assumption of simultaneously acting players.

3. The Nash Prediction Game

The outcome of a prediction game is one particular combination of actions (w

∗

) that incurs costs

∗

) for the players. Each player is aware that this outcome is affected by both players’ action

and that, consequently, their potential to choose an action can have an impact on the other player’s

decision. I n general, there is no action that minimizes one player’s cost function independent of the

other player’s action. In a non-cooperative game, the players are not allowed to communicate while

making their decisions and therefore they have no information about the other player’s strategy. In

this setting, any concept of an optimal move requires additional assumptions on how the adversary

will act.

We model the decision process for w

∗

and

∗

as a static two-player game with complete in-

formation. In a static game, both players commit to an action simultaneously, without information

about their opponent’s action. In a game with complete information, both players know their oppo-

nent’s cost function and action space.

When

−1

and

are known and antagonistic, the assumption that the adversary will seek

the greatest advantage by inﬂicting the greatest damage on

−1

justiﬁes the minimax strategy:

argmin

max

−1

(w,

D). However, when the players’ cost functions are not antagonistic, assuming

that the adversary will inﬂict the greatest possible damage is overly pessimistic. Instead assuming

that the adversary acts rationally in the sense of seeking the greatest possible personal advantage

leads to the concept of a Nash equilibrium. An equilibrium strategy is a steady state of the game in

which neither player has an incentive to unilaterally change their plan of actions .

In static games, equilibrium strategies are called Nash equilibria, which is why we refer to the

resulting predictive model as Nash prediction game (NPG). In a two-player game, a Nash equi-

librium is deﬁned as a pair of actions such that no player can beneﬁt from changing their action

2621

UCKNER, KANZOW, AND SCHEFFER

unilaterally; that is,

−1

∗

) = min

w∈W

−1

(w,

∗

) = min

D⊆X ×Y

∗

D),

where W and X ×Y denote the players’ action spaces.

However, a static prediction game may not have a Nash equilibrium, or it may possess multi-

ple equilibria. If (w

∗

) and (w

′

) are distinct Nash equilibria and each player decides to act

according to a different one of them, then combinations (w

∗

′

) and (w

′

∗

) may incur arbitrarily

high costs for both players. Hence, one can argue that it is rational for an adversary to play a Nash

equilibrium only when the following assumption is satisﬁed.

Assumption 1 The following statements hold:

1. both players act simultaneously;

2. both players have full knowledge about both (empirical) cost functions

(w,

D) deﬁned in (1)

and (2), and both action spaces W and X ×Y ;

3. both players act rational with respect to their cost function in the sense of securing their

lowest possible costs;

4. a unique Nash equilibrium exists.

Whether Assumptions 1.1-1.3 are adequate—especially the assumption of simultaneous actions—

strongly depends on the application. For example, in some applications, the data generator may uni-

laterally be able to acquire information about the model f

before committing to

D. Such situations

are better modeled as a Stackelberg competition (Br

uckner and Scheffer, 2011). On the other hand,

when the learner is able to treat any executed action as part of the training data D and update the

model w, the setting is better modeled as repeated executions of a static game with simultaneous

actions. The adequateness of Assumption 1.4, which we discuss in the following sections, depends

on the chosen loss functions, the cost factors, and the regularizers.

3.1 Existence of a Nash Equilibrium

Theorem 1 of Br

uckner and Scheffer (2009) identiﬁes conditions under which a unique Nash equi-

librium exists. Kanzow located a ﬂaw in the proof of this theorem: The proof argues that the

pseudo-Jacobian can be decomposed into two (strictly) positive stable matrices by showing that the

real part of every eigenvalue of those two matrices is positive. However, this does not generally

imply that the sum of these matrices is positive stable as well since this would require a common

Lyapunov solution (cf. Problem 2.2.6 of Horn and Johnson, 1991). But even if such a solution

exists, the positive deﬁniteness cannot be concluded from the positiveness of all eigenvalues as the

pseudo-Jacobian is generally non-symmetric.

Having “unproven” prior claims, we will now derive sufﬁcient conditions for the existence of a

Nash equilibrium. To this end, we ﬁrst deﬁne

x :=

φ(x

)

,φ(x

)

,...,φ(x

)

∈ φ(X )

⊂ R

m·n

x :=

φ(˙x

)

,φ(˙x

)

,...,φ(˙x

)

∈ φ(X )

⊂ R

m·n

2622

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

as long, concatenated, column vectors induced by feature mapping φ, training sample D = {(x

)}

i=1

and transformed training sample

D = {( ˙x

)}

i=1

, respectively. For terminological harmony, we re-

fer to vector

x as the data generator’s action with corresponding action space φ(X )

We make the following assumptions on the action spaces and the cost functions which enables

us to state the main result on the existence of at least one Nash equilibrium in Lemma 1.

Assumption 2 The players’ cost functions deﬁned in Equations 1 and 2, and their action sets W

and φ(X )

satisfy the properties:

1. loss functions ℓ

(z,y) with v ∈ {−1,+1} are convex and twice continuously differentiable

with respect to z ∈R for all ﬁxed y ∈ Y ;

2. regularizers

Ω

are uniformly strongly convex and twice continuously differentiable with re-

spect to w ∈ W and

x ∈ φ(X )

, respectively;

3. action spaces W and φ(X )

are non-empty, compact, and convex subsets of ﬁnite-dimensional

Euclidean spaces R

and R

m·n

, respectively.

Lemma 1 Under Assumption 2, at least one equilibrium point (w

∗

) ∈ W ×φ(X )

of the Nash

prediction game deﬁned by

min

−1

(w,

∗

)

min

∗

s.t. w ∈ W

s.t.

x ∈ φ(X )

(3)

exists.

Proof. Each player v’s cost function is a sum over n loss terms resulting from loss function ℓ

and

regularizer

Ω

. By Assumption 2, these loss functions are convex and continuous, and the regu-

larizers are uniformly strongly convex and continuous. Hence, both cost functions

−1

(w,

x) and

(w,

x) are continuous in all arguments and uniformly strongly convex in w ∈ W and

x ∈ φ(X )

respectively. As both action spaces W and φ(X )

are non-empty, compact, and convex subsets

of ﬁnite-dimensional Euclidean spaces, a Nash equilibrium exists—see Theorem 4.3 of Basar and

Olsder (1999).

3.2 Uniqueness of the Nash Equilibrium

We will now derive conditions for the uniqueness of an equilibrium of the Nash prediction game

deﬁned in (3). We ﬁrst reformulate the two-player game into an (n + 1)-player game. In Lemma 2,

we then present a sufﬁcient condition for the uniqueness of the Nash equilibrium in this game, and

by applying Proposition 4 and L emma 5-7 we verify whether this condition is met. Finally, we state

the main result in Theorem 8: The Nash equilibrium is unique under certain properties of the loss

functions, the regularizers, and the cost factors which all can be veriﬁed easily.

Taking into account the Cartesian product structure of the data generator’s action space φ(X )

it is not difﬁcult to see that (w

∗

) with

∗



∗T

,...,

∗T



and

∗

:= φ(˙x

∗

) is a solution of the

2623

UCKNER, KANZOW, AND SCHEFFER

two-player game if, and only if, (w

∗

,...,

∗

) is a Nash equilibrium of the (n + 1)-player game

deﬁned by

min

−1

(w,

min

(w,

··· min

(w,

s.t. w ∈ W

s.t.

∈ φ(X ) ··· s.t.

∈ φ(X )

, (4)

which results from (3) by repeating n times the cost function

and minimizing this function with

respect to

∈ φ(X ) for i = 1,..., n. Then the pseudo-gradient (in the sense of Rosen, 1965) of the

game in (4) is deﬁned by

(w,

x) :=







∇

−1

(w,

∇

(w,

∇

(w,

∇

(w,







∈ R

m+m·n

, (5)

with any ﬁxed vector r = [r

,...,r

]

where r

> 0 for i = 0,..., n. The derivative of g

—that is,

the pseudo-Jacobian of (4)—is given by

(w,

x) = Λ



∇

w,w

−1

(w,

x) ∇

−1

(w,

∇

x,w

(w,

x) ∇

(w,



, (6)

where







0 ··· 0

0 r

··· 0

0 0 ··· r







∈ R

(m+m·n)×(m+m·n)

. (7)

Note that the pseudo-gradient g

and the pseudo-Jacobian J

exist when Assumption 2 is satis-

ﬁed. The above deﬁnition of the pseudo-Jacobian enables us to state the following result about the

uniqueness of a Nash equilibrium.

Lemma 2 Let Assumption 2 hold and suppose there exists a ﬁxed vector r = [r

,...,r

]

with

> 0 for all i = 0,1,...,n such that the corresponding pseudo-Jacobian J

(w,

x) is positive deﬁnite

for all (w,

x) ∈ W ×φ(X )

. Then the Nash prediction game in (3) has a unique equilibrium.

Proof. The existence of a Nash equilibrium follows from Lemma 1. Recall from our previous

discussion that the original Nash game in (3) has a unique solution if, and only if, the game from (4)

with one learner and n data generators admits a unique solution. In view of Theorem 2 of Rosen

(1965), the latter attains a unique solution if the pseudo-gradient g

is strictly monotone; that is, if

for all actions w,w

′

∈ W and

′

∈ φ(X )

, the inequality



(w,

x) −g

′

)







−



′



> 0

holds. A sufﬁcient condition for this pseudo-gradient being strictly monotone is the positive deﬁ-

niteness of the pseudo-Jacobian J

(see, e.g., Theorem 7.11 and Theorem 6, respectively, in Geiger

2624

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

and Kanzow, 1999; Rosen, 1965).

To verify whether the positive deﬁniteness condition of Lemma 2 is satisﬁed, we ﬁrst derive the

pseudo-Jacobian J

(w,

x). We subsequently decompose it into a sum of three matrices and analyze

the deﬁniteness of these matrices for the particular choice of vector r with r

:= 1, r

−1,i

+1,i

> 0 for

all i = 1,..., n, with corresponding matrix







0 ··· 0

−1,1

+1,1

··· 0

0 0 ···

−1,n

+1,n







. (8)

This ﬁnally provides us with sufﬁcient conditions which ensure the uniqueness of the Nash equilib-

rium.

3.2.1 DERIVATION OF THE PSEUDO-JACOBIAN

Throughout this section, we denote by ℓ

′

(z,y) and ℓ

′′

(z,y) the ﬁrst and second derivative of the

mapping ℓ

(z,y) with respect to z ∈ R and use the abbreviations

ℓ

′

v,i

:= ℓ

′

(

w,y

ℓ

′′

v,i

:= ℓ

′′

(

w,y

for both players v ∈ {−1,+1} and i = 1, . . .,n.

To state the pseudo-Jacobian for the empirical costs given in (1) and (2), we ﬁrst derive their

ﬁrst-order partial derivatives,

∇

−1

(w,

x) =

∑

i=1

−1,i

ℓ

′

−1,i

+ ρ

−1

∇

Ω

−1

(w), (9)

∇

(w,

x) = c

+1,i

ℓ

′

+1,i

w + ρ

∇

Ω

(x,

x). (10)

This allows us to calculate the entries of the pseudo-Jacobian given in (6),

∇

w,w

−1

(w,

x) =

∑

i=1

−1,i

ℓ

′′

−1,i

+ ρ

−1

∇

w,w

Ω

−1

(w),

∇

−1

(w,

x) = c

−1,i

ℓ

′′

−1,i

+ c

−1,i

ℓ

′

−1,i

∇

(w,

x) = c

+1,i

ℓ

′′

+1,i

+ c

+1,i

ℓ

′

+1,i

∇

(w,

x) = δ

i j

+1,i

ℓ

′′

+1,i

+ ρ

∇

Ω

(x,

x),

where δ

i j

denotes Kronecker’s delta which is 1 if i equals j and 0 otherwise.

We can express these equations more compact as matrix equations. Therefore, we use the

diagonal matrix Λ

as deﬁned in (7) and set Γ

:= diag(c

v,1

ℓ

′′

v,1

,...,c

v,n

ℓ

′′

v,n

). Additionally, we deﬁne

X ∈ R

n×m

as the matrix with rows

,...,

, and n matrices W

∈ R

n×m

with all entries set to zero

2625

UCKNER, KANZOW, AND SCHEFFER

except for the i-th row which is set to w

. Then,

∇

w,w

−1

(w,

x) =

−1

X + ρ

−1

∇

w,w

Ω

−1

(w),

∇

−1

(w,

x) =

−1

+ c

−1,i

ℓ

′

−1,i

∇

(w,

x) = W

X + c

+1,i

ℓ

′

+1,i

∇

(w,

x) = W

+ ρ

∇

Ω

(x,

x).

Hence, the pseudo-Jacobian in (6) can be stated as follows,

(w,

x) = Λ



X 0 ··· 0

0 W

··· W





−1



X 0 ··· 0

0 W

··· W









−1

∇

w,w

Ω

−1

(w) c

−1,1

ℓ

′

−1,1

··· c

−1,n

ℓ

′

−1,n

+1,1

ℓ

′

+1,1

∇

Ω

(x,

x) ··· ρ

∇

Ω

(x,

+1,n

ℓ

′

+1,n

∇

Ω

(x,

x) ··· ρ

∇

Ω

(x,







We now aim at decomposing the right-hand expression in order to verify the deﬁniteness of the

pseudo-Jacobian.

3.2.2 DECOMPOSITION OF THE PSEUDO-JACOBIAN

To verify the positive deﬁniteness of the pseudo-Jacobian, we further decompose the second sum-

mand of the above expression into a positive semi-deﬁnite and a strictly positive deﬁnite matrix.

Therefore, let us denote the smallest eigenvalues of the Hessians of the regularizers on the corre-

sponding action spaces W and φ(X )

−1

:= inf

w∈W

min



∇

w,w

Ω

−1

(w)



, (11)

:= inf

x∈φ(X )

min



∇

Ω

(x,



, (12)

where λ

min

(A) denotes the smallest eigenvalue of the symmetric matrix A.

Remark 3 Note that the minimum in (11) and (12) is attained and is strictly positive: The mapping

min

: M

k×k

→ R is concave on the set of symmetric matrices M

k×k

of dimension k ×k (cf. Exam-

ple 3.10 in Boyd and Vandenberghe, 2004), and in particular, it therefore follows that this mapping

is continuous. Furthermore, the mappings u

−1

: W → M

m×m

with u

−1

(w) := ∇

w,w

Ω

−1

(w) and

: φ(X )

→M

m·n×m·n

with u

(

x) := ∇

Ω

(x,

x) are continuous (for any ﬁxed x) by Assump-

tion 2. Hence, the mappings w 7→ λ

min

−1

(w)) and

x 7→ λ

min

(

x)) are also continuous since

each is precisely the composition λ

min

◦u

of the continuous functions λ

min

and u

for v ∈{−1, +1}.

Taking into account that a continuous mapping on a non-empty compact set attains its minimum, it

follows that there exist elements w ∈ W and

x ∈ φ(X )

such that

−1

= λ

min



∇

w,w

Ω

−1

(w)



= λ

min



∇

Ω

(x,



Moreover, since the Hessians of the regularizers are positive deﬁnite by Assumption 2, we see that

> 0 holds for v ∈{−1,+1}. 3

2626

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

By the above deﬁnitions, we can decompose the regularizers’ Hessians as follows,

∇

w,w

Ω

−1

(w) = λ

−1

+ (∇

w,w

Ω

−1

(w) −λ

−1

∇

Ω

(x,

x) = λ

m·n

+ (∇

Ω

(x,

x) −λ

m·n

As the regularizers are strictly convex, λ

are positive so that for each of the above equations the

ﬁrst summand is positive deﬁnite and the second summand is positive semi-deﬁnite.

Proposition 4 The pseudo-Jacobian has the representation

(w,

x) = J

(1)

(w,

x) + J

(2)

(w,

x) + J

(3)

(w,

x) (13)

where

(1)

(w,

x) = Λ



X 0 ··· 0

0 W

··· W





−1



X 0 ··· 0

0 W

··· W



(2)

(w,

x) = Λ







−1

−1,1

ℓ

′

−1,1

··· c

−1,n

ℓ

′

−1,n

+1,1

ℓ

′

+1,1

··· 0

+1,n

ℓ

′

+1,n

0 ··· ρ







(3)

(w,

x) = Λ



−1

∇

w,w

Ω

−1

(w) −ρ

−1

0 ρ

∇

Ω

(x,

x) −ρ

m·n



The above proposition restates the pseudo-Jacobian as a sum of the three matrices J

(1)

(w,

x),

(2)

(w,

x), and J

(3)

(w,

x). Matrix J

(1)

(w,

x) contains all ℓ

′′

v,i

terms, J

(2)

(w,

x) is a composition of

scaled identity matr ices, and J

(3)

(w,

x) contains the Hessians of the r egularizers where the diagonal

entries are reduced by ρ

−1

and ρ

, respectively. We further analyze these matrices in the

following section.

3.2.3 DEFINITENESS OF THE SUMMANDS OF THE PSEUDO-JACOBIAN

Recall, that we want to investigate whether the pseudo-Jacobian J

(w,

x) is positive deﬁnite for

each pair of actions (w,

x) ∈ W ×φ(X )

. A sufﬁcient condition is that J

(1)

(w,

x), J

(2)

(w,

x), and

(3)

(w,

x) are positive semi-deﬁnite and at least one of these matrices is positive deﬁnite. From

the deﬁnition of λ

, it becomes apparent that J

(3)

is positive semi-deﬁnite. In addition, J

(2)

(w,

obviously becomes positive deﬁnite for sufﬁciently large ρ

as, in this case, the main diagonal

dominates the non-diagonal entries. Finally, J

(1)

(w,

x) becomes positive semi-deﬁnite under some

mild conditions on the loss functions.

In the following we derive these conditions, state lower bounds on the regularization parameters

, and provide formal proofs of the above claims. Therefore, we make the following assumptions

on the loss functions ℓ

and the regularizers

Ω

for v ∈ {−1, +1}. Instances of these functions

satisfying Assumptions 2 and 3 will be given in Section 5. A discussion on the practical implications

of these assumptions is given in the subsequent section.

Assumption 3 For all w ∈ W and

x ∈ φ(X )

with

x =



,...,



the following conditions are

satisﬁed:

2627

UCKNER, KANZOW, AND SCHEFFER

1. the second derivatives of the loss functions are equal for all y ∈ Y and i = 1,...,n,

ℓ

′′

−1

( f

(

),y) = ℓ

′′

( f

(

),y),

2. the players’ regularization parameters satisfy

−1

> τ

−1

where λ

−1

, λ

are the smallest eigenvalues of the Hessians of the regularizers speciﬁed

in (11) and (12), c

= [c

v,1

v,2

,...,c

v,n

]

, and

τ = sup

(x,y)∈φ(X )×Y



ℓ

′

−1

( f

(x),y) + ℓ

′

( f

(x),y)



, (14)

3. for all i = 1,...,n either both players have equal instance-speciﬁc cost factors, c

−1,i

= c

+1,i

or the partial derivative ∇

Ω

(x,

x) of the data generator’s regularizer is independent of

for all j 6= i.

Notice, that τ in Equation 14 can be chosen to be ﬁnite as the set φ(X ) ×Y is assumed to be

compact, and consequently, the values of both continuous mappings ℓ

′

−1

( f

(x),y) and ℓ

′

( f

(x),y)

are ﬁnite for all (x,y) ∈ φ(X ) ×Y .

Lemma 5 Let (w,

x) ∈ W ×φ(X )

be arbitrarily given. Under Assumptions 2 and 3, the matrix

(1)

(w,

x) is symmetric positive semi-deﬁnite (but not positive deﬁnite) for Λ

deﬁned as in Equa-

tion 8.

Proof. The special structure of Λ

X, and W

gives

(1)

(w,

x) =



X 0 ··· 0

0 W

··· W





−1

ϒΓ



X 0 ··· 0

0 W

··· W



with ϒ := diag(r

,...,r

). From the as sumption ℓ

′′

−1,i

= ℓ

′′

+1,i

and the deﬁnition r

= 1, r

−1,i

+1,i

> 0

for all i = 1,..., n it follows that Γ

−1

= ϒΓ

, such that

(1)

(w,

x) =



X 0 ··· 0

0 W

··· W





−1



X 0 ··· 0

0 W

··· W



which is obviously a symmetric matrix. Furthermore, we show that z

(1)

(w,

x)z ≥ 0 holds for

all vectors z ∈ R

m+m·n

. To this end, let z be arbitrarily given, and partition this vector in z =



,...,z



with z

∈ R

for all i = 0,1,..., n. Then a simple calculation shows that

(1)

(w,

x)z =

∑

i=1



+ z



−1,i

ℓ

′′

−1,i

≥ 0

since ℓ

′′

−1,i

≥ 0 f or all i = 1,..., n in view of the assumed convexity of mapping ℓ

−1

(z,y). Hence,

(1)

(w,

x) is positive semi-deﬁnite. This matrix cannot be positive deﬁnite since we have

(1)

(w,

x)z = 0 for the particular vector z deﬁned by z

:= −w and z

:= x

for all i = 1,. . . , n.

2628

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

Lemma 6 Let (w,

x) ∈ W ×φ(X )

be arbitrarily given. Under Assumptions 2 and 3, the matrix

(2)

(w,

x) is positive deﬁnite for Λ

deﬁned as in Equation 8.

Proof. A sufﬁcient and necessary condition for the (possibly asymmetric) matrix J

(2)

(w,

x) to be

positive deﬁnite is that the Hermitian matrix

H(w,

x) := J

(2)

(w,

x) + J

(2)

(w,

is positive deﬁnite, that is, all eigenvalues of H(w,

x) are positive. Let Λ

denote the square root

of Λ

which is deﬁned in such a way that the diagonal elements of Λ

are the square roots of the

corresponding diagonal elements of Λ

. Furthermore, we denote by Λ

−

the inverse of Λ

. Then,

by Sylvester’s law of inertia, the matrix

H(w,

x) := Λ

−

H(w,

x)Λ

−

has the same number of positive, zero, and negative eigenvalues as matrix H(w,

x) itself.

Hence, J

(2)

(w,

x) is positive deﬁnite if, and only if, all eigenvalues of

H(w,

x) = Λ

−



(2)

(w,

x) + J

(2)

(w,



−

= Λ

−







−1

−1,1

ℓ

′

−1,1

··· c

−1,n

ℓ

′

−1,n

+1,1

ℓ

′

+1,1

··· 0

+1,n

ℓ

′

+1,n

0 ··· ρ







−







−1

+1,1

ℓ

′

+1,1

··· c

+1,n

ℓ

′

+1,n

−1,1

ℓ

′

−1,1

··· 0

−1,n

ℓ

′

−1,n

0 ··· ρ







−







2ρ

−1

˜c

··· ˜c

˜c

2ρ

··· 0

˜c

0 ··· 2ρ







are positive, where ˜c

√

−1,i

+1,i

(ℓ

′

−1,i

+ ℓ

′

+1,i

). Each eigenvalue λ of this matrix satisﬁes



H(w,

x) −λI

m+m·n



v = 0

for the corresponding eigenvector v



,...,v



with v

∈R

for i = 0, 1, . . .,n. This eigen-

value equation can be rewritten block-wise as

(2ρ

−1

−λ)v

∑

i=1

˜c

= 0, (15)

(2ρ

−λ)v

+ ˜c

= 0 ∀i = 1,...,n. (16)

2629

UCKNER, KANZOW, AND SCHEFFER

To compute all possible eigenvalues, we consider two cases. First, assume that v

= 0. Then (15)

and (16) reduce to

∑

i=1

˜c

= 0 and (2ρ

−λ)v

= 0 ∀i = 1,...,n.

Since v

= 0 and eigenvector v 6= 0, at least one v

is non-zero. This implies that λ = 2ρ

an eigenvalue. Using the fact that the null space of the linear mapping v 7→

∑

i=1

˜c

has dimension

(n −1)·m (we have n ·m degrees of freedom counting all components of v

,...,v

and m equations

∑

i=1

˜c

= 0), it follows that λ = 2ρ

is an eigenvalue of multiplicity (n −1) ·m.

Now we consider the second case where v

6= 0. We may further assume that λ 6= 2ρ

(since otherwise we get the same eigenvalue as before, just with a different multiplicity). We then

get from (16) that

= −

˜c

2ρ

−λ

∀i = 1,..., n, (17)

and when substituting this expression into (15), we obtain

(2ρ

−1

−λ) −

∑

i=1

˜c

2ρ

−λ

= 0.

Taking into account that v

6= 0, this implies

0 = 2ρ

−1

−λ −

2ρ

−λ

∑

i=1

˜c

and, therefore,

0 = λ

−2(ρ

−1

+ ρ

)λ + 4ρ

−1

−

∑

i=1

˜c

The roots of this quadratic equation are

λ = ρ

−1

+ ρ

(ρ

−1

−ρ

)

∑

i=1

˜c

, (18)

and these are the remaining eigenvalues of

H(w,

x), each of multiplicity m since there are precisely

m linearly independent vectors v

6= 0 whereas the other vectors v

(i = 1,...,n) are uniquely deﬁned

by (17) in this case. In particular, this implies that the dimensions of all three eigenspaces together

is (n −1)m + m + m = (n + 1)m, hence other eigenvalues cannot exis t. Since the eigenvalue λ =

2ρ

is positive by Remark 3, it remains to show that the roots in (18) are positive as well. By

Assumption 3, we have

∑

i=1

˜c

∑

i=1

−1,i

+1,i

(ℓ

′

−1,i

+ ℓ

′

+1,i

)

≤ 4τ

−1

< 4ρ

−1

where c

= [c

v,1

v,2

,··· ,c

v,n

]

. This inequality and Equation 18 give

λ = ρ

−1

+ ρ

(ρ

−1

−ρ

)

∑

i=1

˜c

> ρ

−1

+ ρ

−

(ρ

−1

−ρ

)

+ 4ρ

−1

= 0.

2630

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

As all eigenvalues of

H(w,

x) are positive, matrix H(w,

x) and, consequently, also the matrix

(2)

(w,

x) are positive deﬁnite.

Lemma 7 Let (w,

x) ∈ W ×φ(X )

be arbitrarily given. Under Assumptions 2 and 3, the matrix

(3)

(w,

x) is positive semi-deﬁnite for Λ

deﬁned as in Equation 8.

Proof. By Assumption 3, either both players have equal instance-speciﬁc costs, or the partial gradi-

ent ∇

Ω

(x,

x) of the sender’s regularizer is independent of

for all j 6= i and i = 1,..., n. Let

us consider the ﬁrst case where c

−1,i

= c

+1,i

, and consequently r

= 1, for all i = 1,...,n, such that

(3)

(w,

x) =



−1

∇

w,w

Ω

−1

(w) −ρ

−1

0 ρ

∇

Ω

(x,

x) −ρ

m·n



The eigenvalues of this block diagonal matrix are the eigenvalues of the matrix

−1

(∇

w,w

Ω

−1

(w) −λ

−1

) together with those of ρ

(∇

Ω

(x,

x) −λ

m·n

). From the deﬁ-

nition of λ

in (11) and (12) follows that these matrices are positive semi-deﬁnite for v ∈ {−1, +1}.

Hence, J

(3)

(w,

x) is positive semi-deﬁnite as well.

Now, let us consider the second case where we assume that ∇

Ω

(x,

x) is independent of

for all j 6= i. Hence, ∇

Ω

(x,

x) = 0 for all j 6= i such that

(3)

(w,

x) =







−1

Ω

−1

0 ··· 0

0 ρ

−1,1

+1,1

Ω

+1,1

··· 0

0 0 ··· ρ

−1,n

+1,n

Ω

+1,n







where

Ω

−1

:= ∇

w,w

Ω

−1

(w) −λ

−1

and

Ω

+1,i

= ∇

Ω

(x,

x) −λ

. The eigenvalues of this

block diagonal matrix are again the union of the eigenvalues of the single blocks ρ

−1

Ω

−1

and

−1,i

+1,i

Ω

+1,i

for i = 1,...,n. As in the ﬁrst part of the proof,

Ω

−1

is positive semi-deﬁnite. The

eigenvalues of ∇

Ω

(x,

x) are the union of all eigenvalues of ∇

Ω

(x,

x). Hence, each of

these eigenvalues is larger or equal to λ

and thus, each block

Ω

+1,i

is positive semi-deﬁnite. The

factors ρ

−1

> 0 and ρ

−1,i

+1,i

> 0 are multipliers that do not affect the deﬁniteness of the blocks, and

consequently, J

(3)

(w,

x) is positive semi-deﬁnite as well.

The previous results guarantee the existence and uniqueness of a Nash equilibrium under the

stated assumptions.

Theorem 8 Let Assumptions 2 and 3 hold. Then the Nash prediction game in (3) has a unique

equilibrium.

Proof. The existence of an equilibrium of the Nash prediction game in (3) follows from Lemma 1.

Proposition 4 and Lemma 5 to 7 imply that there is a positive diagonal matrix Λ

such that J

(w,

is positive deﬁnite for all (w,

x) ∈ W ×φ(X )

. Hence, the uniqueness follows from Lemma 2.

2631

UCKNER, KANZOW, AND SCHEFFER

3.2.4 PRACTICAL IMPLICATIONS OF ASSUMPTIONS 2 AND 3

Theorem 8 guarantees the uniqueness of the equilibrium only if the cost functions of learner and

data generator relate in a certain way that is deﬁned by Assumption 3. In addition, each of the

cost functions has to satisfy Assumption 2. This section discusses the practical implication of these

assumptions.

The conditions of Assumption 2 impose rather technical limitations on the cost functions . The

requirement of convexity is quite ordinary in the machine learning context. In addition, the loss

function has to be twice continuously differentiable, which restricts the family of eligible loss func-

tions. However, this condition can still be met easily; for ins tance, by smoothed versions of the

hinge loss. The second requirement of uniformly strongly convex and twice continuously differ-

entiable regularizers is, again, only a week restriction in practice. These requirements are met by

standard regularizers; they occur, for instance, in the optimization criteria of SVMs and logistic

regression. The requirement of non-empty, compact, and convex action spaces may be a restriction

when dealing with binary or multinomial attributes. However, relaxing the action spaces of the data

generator would typically result in a strategy that is more defensive than would be optimal but still

less defensive than a worst-case strategy.

The ﬁrst condition of Assumptions 3 requires the cost functions of learner and data generator

to have the same curvatures. This is a crucial restriction; if the cost functions differ arbitrarily the

Nash equilibrium may not be unique. T he requirement of identical curvatures is met, for instance,

if one player chooses a loss function ℓ( f

(

),y) which only depends on the term y f

(

), such as

for SVM’s hinge loss or the logistic loss. In this case, the condition is met when the other player

chooses the ℓ(−f

(

),y). This loss is in some sense the opposite of ℓ( f

(

),y) as it approaches

zero when the other goes to inﬁnity and vice versa. In this case, the cost functions may still be

non-antagonistic because the player’s cost functions may contain instance-speciﬁc cost factors c

v,i

that can be modeled independently for the players.

The second part of Assumptions 3 couples the degree of regularization of the players. If the data

generator produces instances at application time that differ greatly from the instances at training

time, then the learner is required to regularize strongly for a unique equilibrium to exist. If the

distributions at training and application time are more similar, the equilibrium is unique for smaller

values of the learner’s regularization parameters. This requirement is in line with the intuition that

when the training instances are a poor approximation of the distribution at application time, then

imposing only weak regularization on the loss function will result in a poor model.

The ﬁnal requirement of As sumptions 3 is, again, rather a technical limitation. It states that the

interdependencies between the players’ instance-speciﬁc costs must be either captured by the regu-

larizers, leading to a full Hessian, or by cost factors. These cost factors of learner and data generator

may differ arbitrarily if the gradient of the data generator’s costs of transforming an instance x

into

are independent of all other instances

with j 6= i. This is met, for instance, by cost models that

only depend on some measure of the distance between x

and

4. Finding the Unique Nash Equilibrium

According to Theorem 8, a unique equilibrium of the Nash prediction game in (3) exists for suitable

loss functions and regularizers. To ﬁnd this equilibrium, we derive and study two distinct methods:

The ﬁrst is based on the Nikaido-Isoda function that is constructed such that a minimax solution of

this function is an equilibrium of the Nash prediction game and vice versa. This problem is then

2632

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

solved by inexact linesearch. In the second approach, we reformulate the Nash prediction game into

a variational inequality problem which is solved by a modiﬁed extragradient method.

The data generator’s action of transforming the input distribution manifests in a concatenation of

transformed training instances

x ∈ φ(X )

mapped into the feature space

:= φ( ˙x

) for i = 1,...,n,

and the learner’s action is to choose weight vector w ∈W of classiﬁer h(x) = sign f

(x) with linear

decision function f

(x) = w

φ(x).

4.1 An Inexact Linesearch Approach

To solve for a Nash equilibrium, we again consider the game from (4) with one learner and n data

generators. A solution of this game can be identiﬁed with the help of the weighted Nikaido-Isoda

function (Equation 19). For any two combinations of actions (w,

x) ∈ W ×φ(X )

and (w

′

) ∈

W ×φ(X )

with

x =



,...,



and

′



′T

,...,

′T



, this function is the weighted s um of

relative cost savings that the n +1 players can enjoy by changing from strategy w to w

′

and

′

respectively, while the other players continue to play according to (w,

x); that is,

(w,

x,w

′

) := r



−1

(w,

x) −

−1

′



∑

i=1



(w,

x) −

(w,

(i)

)



, (19)

where

(i)



,...,

′T

,...,



. Let us denote the weighted sum of greatest possible cost sav-

ings with respect to any given combination of actions (w,

x) ∈ W ×φ(X )

(w,

x) := max

′

)∈W × φ(X )

(w,

x,w

′

), (20)

where

w(w,

x),

x(w,

x) denotes the corresponding pair of maximizers. Note that the maximum

in (20) is attained for any (w,

x), since W ×φ(X )

is assumed to be compact and ϑ

(w,

x,w

′

) is

continuous in (w

′

By these deﬁnitions, a combination (w

∗

) is an equilibrium of the Nash prediction game if,

and only if,

∗

) is a global minimum of mapping

with

∗

) = 0 for any ﬁxed weights

> 0 and i = 0, . . .,n, see Proposition 2.1(b) of von Heusinger and Kanzow (2009). Equivalently,

a Nash equilibrium simultaneously satisﬁes both equations

w(w

∗

) = w

∗

and

x(w

∗

) =

∗

The signiﬁcance of this observation is that the equilibrium problem in (3) can be reformulated

into a minimization problem of the continuous mapping

(w,

x). To solve this minimization prob-

lem, we make use of Corollary 3.4 (von Heusinger and Kanzow, 2009). We set the weights r

:= 1

and r

−1,i

+1,i

for all i = 1,. . . , n as in (8), which ensures the main condition of Corollary 3.4; that

is, the positive deﬁniteness of the Jacobian J

(w,

x) in (13) (cf. proof of Theorem 8). According to

this corollary, vectors

−1

(w,

x) :=

w(w,

x) −w and d

(w,

x) :=

x(w,

x) −

form a descent direction d(w,

x) := [d

−1

(w,

]

(w,

x) at any position (w,

x) ∈

W ×φ(X )

(except for the Nash equilibrium where d(w

∗

) = 0), and consequently, there exists

t ∈ [0, 1] such that

(w +td

−1

(w,

x),

x +td

(w,

x)) <

(w,

x).

Since, (w,

x) and (

w(w,

x),

w(w,

x)) are feasible combinations of actions, the convexity of the action

spaces ensures that (w + td

−1

(w,

x),

x + td

(w,

x)) is a feasible combination for any t ∈ [0,1] as

well. The following algorithm exploits these properties.

2633

UCKNER, KANZOW, AND SCHEFFER

Algorithm 1 ILS: Inexact Linesearch Solver for Nash Prediction Games

Require: Cost functions

as deﬁned in (1) and (2), and action spaces W and φ(X )

1: Select initial w

(0)

∈ W , set

(0)

:= x, set k := 0, and select σ ∈(0,1) and β ∈ (0,1).

2: Set r

:= 1 and r

−1,i

+1,i

for all i = 1,..., n.

3: repeat

4: Set d

(k)

−1

(k)

−w

(k)

where

(k)

:= argmax

′

∈W



(k)

′

(k)



5: Set d

(k)

−

(k)

where

(k)

:= argmax

′

∈φ(X )



(k)

′



6: Find maximal step size t

(k)

∈



| l ∈ N



with



(k)



−



(k)

−1

(k)



≥ σt

(k)





(k)

−1



(k)





7: Set w

(k+1)

:= w

(k)

−1

8: Set

(k+1)

(k)

9: Set k := k + 1.

10: until



(k)

−w

(k−1)



(k)

−

(k−1)



≤ ε.

The convergence properties of Algorithm 1 are discussed by von Heusinger and Kanzow (2009),

so we skip the details here.

4.2 A Modiﬁed Extragradient Approach

In Algorithm 1, line 4 and 5, as well as the linesearch in line 6, require to solve a concave maximiza-

tion problem within each iteration. As this may become computationally demanding, we derive a

second approach based on extragradient descent. Therefore, instead of reformulating the equilib-

rium problem into a minimax problem, we directly address the ﬁrst-order optimality conditions of

each players’ minimization problem in (4): Under Assumption 2, a combination of actions (w

∗

)

with

∗



∗T

,...,

∗T



satisﬁes each player’s ﬁrst-order optimality conditions if, and only if, for

all (w,

x) ∈ W ×φ(X )

the following inequalities hold,

∇

−1

∗

)

(w −w

∗

) ≥ 0,

∇

∗

)

(

−

∗

) ≥ 0 ∀i = 1, . . . , n.

As the joint action space of all players W ×φ(X )

is precisely the full Cartesian product of the

learner’s action set W and the n data generators’ action sets φ(X ), the (weighted) sum of those

individual optimality conditions is also a sufﬁcient and necessary optimality condition for the equi-

librium problem. Hence, a Nash equilibrium (w

∗

) ∈ W ×φ(X )

is a solution of the variational

inequality problem,

∗

)





−



∗



≥ 0 ∀(w,

x) ∈ W ×φ(X )

(21)

and vice versa (cf. Proposition 7.1 of Harker and Pang, 1990). The pseudo-gradient g

in (21) is

deﬁned as in (5) with ﬁxed vector r = [r

,...,r

]

where r

:= 1 and r

−1,i

+1,i

for all i = 1,...,n

2634

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

(cf. Equation 8). Under Assumption 3, this choice of r ensures that the mapping g

(w,

x) is continu-

ous and strictly monotone (cf. proof of Lemma 2 and Theorem 8). Hence, the variational inequality

problem in (21) can be solved by modiﬁed extragradient descent (see, for instance, Chapter 7.2.3

of Geiger and Kanzow, 1999). Before presenting Algorithm 2, which is an extragradient-based

algorithm for the Nash prediction game, let us denote the

-projection of a into the non-empty,

compact, and convex set A by

(a) := arg min

′

∈A

ka −a

′

Notice, that if A := {a ∈ R

| kak

≤ κ} is the closed

-ball of radius κ > 0 and a /∈ A, this

projection simply reduces to a rescaling of vector a to length κ.

Based on this deﬁnition of Π

, we can now state an iterative method (Algorithm 2), which—

apart from back projection steps—does not require solving an optimization problem in each itera-

tion. The proposed algorithm converges to a solution of the variational inequality problem in 21;

that is, the unique equilibrium of the Nash prediction game, if Assumptions 2 and 3 hold (cf. Theo-

rem 7.40 of Geiger and Kanzow, 1999).

Algorithm 2 EDS: Extragradient Descent Solver for Nash Prediction Games

Require: Cost functions

as deﬁned in (1) and (2), and action spaces W and φ(X )

1: Select initial w

(0)

∈ W , set

(0)

:= x, set k := 0, and select σ ∈(0,1) and β ∈ (0,1).

2: Set r

:= 1 and r

−1,i

+1,i

for all i = 1,..., n.

3: repeat

4: Set

(k)

−1

(k)

:= Π

W ×φ(X )



(k)



−g



(k)





−



(k)



5: Find maximal step size t

(k)

∈



| l ∈ N



with

−g



(k)

−1

(k)



(k)

−1

(k)

≥ σ





(k)

−1



(k)





6: Set



(k)





(k)



(k)

−1

(k)

7: Set step size of extragradient

(k)

:= −

(k)





(k)







(k)



(k)

−1

(k)

8: Set



(k+1)



:= Π

W ×φ(X )



(k)



−γ

(k)



(k)





9: Set k := k + 1.

10: until



(k)

−w

(k−1)



(k)

−

(k−1)



≤ ε.

2635

UCKNER, KANZOW, AND SCHEFFER

5. Instances of the Nash Prediction Game

In this section, we present two instances of the Nash prediction game and investigate under which

conditions those games possess unique Nash equilibria. We start by specifying both players’ loss

function and regularizer. An obvious choice for the loss function of the learner ℓ

−1

(z,y) is the

zero-one loss deﬁned by

ℓ

0/1

(z,y) :=



1 , if yz < 0

0 , if yz ≥ 0

A possible choice for the data generator’s loss is ℓ

0/1

(z,−1) which penalizes positive decision val-

ues z, independently of the class label. The rationale behind this choice is that the data generator

experiences costs when the learner blocks an event, that is, assigns an instance to the positive class.

For instance, a legitimate email sender experiences costs when a legitimate email is erroneously

blocked just like an abusive sender, also amalgamated into the data generator, experiences costs

when spam messages are blocked. However, the zero-one loss violates Assumption 2 as it is neither

convex nor twice continuously differentiable. In the following sections, we therefore approximate

the zero-one loss by the logistic loss and a newly derived trigonometric loss, which both satisfy

Assumption 2.

Recall that

Ω

(D,

D) is an estimate of the transformation costs that the data generator incurs

when transforming the distribution that generates the instances x

at training time into the distribu-

tion that generates the instances

at application time. In our analysis, we approximate these costs

by the average squared

-distance between x

and ˙x

in the feature space induced by mapping φ, that

is,

Ω

(D,

D) :=

∑

i=1

φ(˙x

) −φ(x

)

. (22)

The learner’s regularizer

Ω

−1

(w) penalizes the complexity of the predictive model h(x) = sign f

(x).

We consider Tikhonov regularization, which, for linear decision functions f

, reduces to the squared

-norm of w,

Ω

−1

(w) :=

kwk

. (23)

Before presenting the Nash logistic regression (NLR) and the Nash support vector machine (NSVM),

we turn to a discussion on the applicability of general kernel functions.

5.1 Applying Kernels

So far, we assumed the knowledge of feature mapping φ : X → φ(X ) such that we can compute

an explicit feature representation φ(x

) of the training instances x

for all i = 1,...,n. However,

in some applications, such a feature mapping is unwieldy or hard to identify. Instead, one is often

equipped with a kernel function k : X ×X →R which measures the similarity between two instances.

Generally, kernel function k is assumed to be a positive-semideﬁnite kernel such that it can be stated

in terms of a scalar product in the corresponding reproducing kernel Hilbert space, that is, ∃φ with

k(x, x

′

) = φ(x)

φ(x

′

To apply the representer theorem (see, e.g., Sch

olkopf et al., 2001) we assume that the trans-

formed instances lie in the span of the mapped training instances, that is, we restrict the data gener-

ator’s action space such that the transformed instances ˙x

are mapped into the same subspace of the

reproducing kernel Hilbert space as the unmodiﬁed training instances x

. By this assumption, the

2636

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

weight vector w ∈W and the transformed instances φ(˙x

) ∈ φ(X ) for i = 1,...,n can be expressed

as linear combinations of the mapped training instances, that is, ∃α

,Ξ

i j

such that

w =

∑

i=1

φ(x

) and φ(˙x

) =

∑

i=1

i j

φ(x

) ∀ j = 1,. . . , n.

Further, let us assume that the action spaces W and φ(X )

can be adequately translated into dual

action spaces A ⊂ R

and Z ⊂ R

n×n

, which is possible, for instance, if W and φ(X )

are closed

-balls. Then, a kernelized variant of the Nash prediction game is obtained by inserting the above

equations into the players’ cost functions in (1) and (2) with regularizers in (22) and (23),

−1

(α,Ξ) =

∑

i=1

−1,i

ℓ

−1

(α

KΞe

) + ρ

−1

Kα, (24)

(α,Ξ) =

∑

i=1

+1,i

ℓ

(α

KΞe

) + ρ



(Ξ −I

)

K(Ξ −I

)



, (25)

where e

∈ {0,1}

is the i-th unit vector, α ∈ A is the dual weight vector, Ξ ∈ Z is the dual trans -

formed data matrix, and K ∈ R

n×n

is the kernel matrix with K

i j

:= k(x

). In the dual Nash

prediction game with cost functions ( 24) and (25), the learner chooses the dual weight vector

α = [α

,...,α

]

and classiﬁes a new instance x by h(x) = sign f

(x) with f

(x) =

∑

i=1

k(x

,x).

In contrast, the data generator chooses the dual transformed data matrix Ξ, which implicitly reﬂects

the change of the training distribution.

Their transformation costs are in proportion to the deviation of Ξ from the identity matrix I

where if Ξ equals I

, the learner’s task reduces to standard kernelized empirical risk minimization.

The proposed Algorithms 1 and 2 can be readily applied when replacing w by α and

by Ξe

for

all i = 1,..., n.

An alternative approach to a kernelization of the Nash prediction game is to ﬁrst construct an

explicit feature representation with respect to the given kernel function k and the training instances

and then to train the Nash model by applying this feature mapping. Here, we again assume that the

transformed instances φ(˙x

) as well as the weight vector w lie in the span of the explicitly mapped

training instances φ(x). Let us consider the kernel PCA map (see, e.g., Sch

olkopf and Smola, 2002)

deﬁned by

PCA

: x 7→Λ

[k(x

,x),...,k(x

,x)]

, (26)

where V is the column matrix of eigenvectors of kernel matrix K, Λ is the diagonal matrix with the

corresponding eigenvalues such that K = VΛV

, and Λ

denotes the pseudo-inverse of the square

root of Λ with Λ = Λ

Remark 9 Notice that for any positive-semideﬁnite kernel function k : X ×X → R and ﬁxed train-

ing instances x

,...,x

∈ X , the PCA map is a uniquely deﬁned real function with φ

PCA

: X → R

such that k(x

) = φ

PCA

)

PCA

) for any i, j ∈ {1,...,n}: We ﬁrst show that φ

PCA

is a real

mapping from the input space X to the Euclidean space R

. As x 7→[k(x

,x),...,k(x

,x)]

is a real

vector-valued function and V is a real n ×n matrix, it remains to show that the pseudo-inverse of

is real as well. Since the kernel function is positive-semideﬁnite, all eigenvalues λ

of K are

non-negative, and hence, Λ

is a diagonal matrix with real diagonal entries

√

for i = 1, . . .,n. The

2637

UCKNER, KANZOW, AND SCHEFFER

pseudo-inverse of this matrix is the uniquely deﬁned diagonal matrix Λ

with real non-negative

diagonal entries

√

if λ

> 0 and zero otherwise. This proves the ﬁrst claim. The PCA map also

satisﬁes k(x

) = φ

PCA

)

PCA

) for any pair of training instances x

and x

PCA

) = Λ

[k(x

),...,k(x

)]

= Λ

VΛV

= Λ

ΛV

for all i = 1,..., n and consequently

PCA

)

PCA

) = e

VΛΛ

ΛV

= e

VΛΛ

ΛV

= e

VΛV

= e

= K

i j

= k(x

)

which proves the second claim. 3

An equilibrium strategy pair w

∗

∈ W and [φ

PCA

(˙x

∗

)

,...,φ

PCA

(˙x

∗

)

]

∈ φ(X )

can be iden-

tiﬁed by applying the PCA map together with Algorithms 1 or 2. To classify a new instance

x ∈ X we may ﬁrst map x into the PCA map-induced feature space and apply the linear classiﬁer

h(x) = sign f

∗

(x) with f

∗

(x) = w

∗

PCA

(x). Alternatively, we can derive a dual representation

of w

∗

such that w

∗

∑

i=1

∗

PCA

), and consequently f

∗

(x) = f

∗

(x) =

∑

i=1

∗

k(x

,x), where

∗

= [α

∗

,...,α

∗

]

is a not necessarily uniquely deﬁned dual weight vector of w

∗

. Therefore, we

have to identify a solution α

∗

of the linear system

∗

= Λ

Kα

∗

. (27)

A direct calculation shows that

∗

:= VΛ

∗

(28)

is a solution of (27) provided that either all elements λ

of the diagonal matrix Λ are positive or that

= 0 implies that the same component of the vector w

∗

is also equal to zero (in which case the

solution is non-unique). In fact, inserting (28) in (27) then gives

Kα

∗

= Λ

VΛV

VΛ

∗

= Λ

∗

= w

∗

whereas in the other cases the linear system (27) is obviously inconsistent. The advantage of the

latter approach is that classifying a new instances x ∈ X requires the computation of the scalar

product

∑

i=1

∗

k(x

,x) rather than a matrix multiplication when mapping x into the PCA map-

induced feature space (cf. Equation 26).

When implementing a kernelized solution, the data generator has to generate instances in the

input space with dual r epresentation KΞ

∗

,...,KΞ

∗

and φ

PCA

(˙x

∗

),..., φ

PCA

(˙x

∗

), respectively.

To this end, the data generator must solve a pre-image problem which typically has a non-unique

solution. However, as every solution of this problem incurs the same costs to both players the data

generator is free to select any of them. To ﬁnd such a s olution, the data generator may solve a

non-convex optimization problem as proposed by Mika et al. (1999), or may apply a non-iterative

method (Kwok and Tsang, 2003) based on multidimensional scaling.

2638

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

5.2 Nash Logistic Regression

In this section we study the particular ins tance of the Nash prediction game where each players’ loss

function rests on the negative logarithm of the logistic function σ(a) :=

1+e

−a

, that is, the logistic

loss

ℓ

(z,y) := −log σ(yz) = log



1 + e

−yz



. (29)

We consider the regularizers in (22) and (23), respectively, which give rise to the following deﬁnition

of the Nash logistic regression (NLR).

In the following deﬁnition, column vectors x := [x

,...,x

]

and

x := [

,...,

]

again de-

note the concatenation of the original and the transformed training instances, respectively, which

are mapped into the feature space by x

:= φ(x

) and

:= φ(˙x

Deﬁnition 10 The Nash logistic regression (NLR) is an instance of the Nash prediction game with

non-empty, compact, and convex action spaces W ⊂ R

and φ(X )

⊂ R

m·n

and cost functions

−1

(w,

x) :=

∑

i=1

−1,i

ℓ

) + ρ

−1

kwk

(w,

x) :=

∑

i=1

+1,i

ℓ

,−1) + ρ

∑

i=1

−x

where ℓ

is speciﬁed in (29).

As in our introductory discussion, the data generator’s loss function ℓ

(z,y) := ℓ

(z,−1) pe-

nalizes positive decision values independently of the class label y. In contrast, instances that pass

the classiﬁer, that is, instances with negative decision values, incur little or almost no costs. By the

above deﬁnition, the Nash logistic regression obviously satisﬁes Assumption 2, and according to

the following corollary, also satisﬁes Assumption 3 for suitable regularization parameters.

Corollary 11 Let the Nash logistic regression be speciﬁed as in Deﬁnition 10 with positive regular-

ization parameters ρ

−1

and ρ

which satisfy

−1

≥ nc

−1

, (30)

then Assumption 2 and 3 hold, and consequently, the Nash logistic regression possess a unique Nash

equilibrium.

Proof. By Deﬁnition 10, both players employ the logistic loss with ℓ

−1

(z,y) := ℓ

(z,y) and ℓ

(z,y) :=

ℓ

(z,−1) and the regularizers in (22) and (23), respectively. Let

ℓ

′

−1

(z,y) = −y

1+e

ℓ

′

(z,y) =

1+e

−z

ℓ

′′

−1

(z,y) =

1+e

−z

ℓ

′′

(z,y) =

1+e

−z

(31)

denote the ﬁrst and second derivatives of the players’ loss functions with respect to z ∈ R. Further,

let

∇

Ω

−1

(w) = w

∇

Ω

(x,

x) =

(

x −x)

∇

w,w

Ω

−1

(w) = I

∇

Ω

(x,

x) =

m·n

denote the gradients and Hessians of the players’ regularizers. Assumption 2 holds as:

2639

UCKNER, KANZOW, AND SCHEFFER

1. The the second derivatives of ℓ

−1

(z,y) and ℓ

(z,y) are positive and continuous for all z ∈ R

and y ∈Y . Consequently, ℓ

(z,y) is convex and twice continuously differentiable with respect

to z for v ∈ {−1,+1} and ﬁxed y.

2. The Hessians of the players’ regularizers are ﬁxed, positive deﬁnite matrices and consequently

both regularizers are twice continuously differentiable and uniformly strongly convex in w ∈

W and

x ∈ φ(X )

(for any ﬁxed x ∈ φ(X )

), respectively.

3. By Deﬁnition 10, the players’ action sets are non-empty, compact, and convex subsets of

ﬁnite-dimensional Euclidean spaces.

Assumption 3 holds as for all z ∈ R and y ∈ Y :

1. The second derivatives of ℓ

−1

(z,y) and ℓ

(z,y) in (31) are equal.

2. The sum of the ﬁrst derivatives of the los s functions is bounded,

ℓ

′

−1

(z,y) + ℓ

′

(z,y) = −y

1 + e

−z

(

1−e

−z

1+e

−z

, if y = +1

1+e

−z

, if y = −1

∈ (−1, 2),

which together with Equation 14 gives

τ = sup

(x,y)∈φ(X )×Y



ℓ

′

−1

( f

(x),y) + ℓ

′

( f

(x),y)



< 1.

The supremum τ is strictly less than 1 since f

(x) is ﬁnite for compact action sets W and

φ(X )

. The smallest eigenvalues of the players’ regularizers are λ

−1

= 1 and λ

, such

that inequalities

−1

≥ nc

−1

> τ

−1

hold.

3. The partial gradient ∇

Ω

(x,

x) =

(

−x

) of the data generator’s regularizer is indepen-

dent of

for all j 6= i and i = 1,...,n.

As Assumptions 2 and 3 are satisﬁed, the existence of a unique Nas h equilibrium follows immedi-

ately from Theorem 8.

Recall, that the weighting factors c

v,i

are strictly positive with

∑

i=1

v,i

= 1 for both players

v ∈ {−1,+1}. In particular, it therefore follows that in the unweighted case where c

v,i

for

all i = 1, . . .,n and v ∈ {−1, +1}, a sufﬁcient condition to ensure the existence of a unique Nash

equilibrium is to set the learner’s regularization parameter to ρ

−1

≥

5.3 Nash Support Vector Machine

The Nash logistic regression tends to non-sparse solutions. This becomes particularly apparent if

the Nash equilibrium (w

∗

) is an interior point of the joint action set W ×φ(X )

in which case

the (partial) gradients in (9) and (10) are zero at (w

∗

). For regularizer (23), this implies that w

∗

2640

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

a linear combination of the transformed instances

where all weighting factors are non-zero since

the ﬁrst derivative of the logistic loss as well as the cost factors c

−1,i

are non-zero for all i = 1,...,n.

The support vector machine (SVM), which employs the hinge loss,

ℓ

(z,y) := max(0,1 −yz) =



1 −yz , if yz < 1

0 , if yz ≥ 1

does not suffer from non-sparsity, however, the hinge loss obviously violates Assumption 2 as it

is not twice continuously differentiable. Therefore, we propose a twice continuously differentiable

loss function that we call trigonometric loss, which satisﬁes Assumptions 2 and 3.

Deﬁnition 12 For any ﬁxed smoothness factor s > 0, the trigonometric loss is deﬁned by

ℓ

(z,y) :=







−yz , if yz < −s

s−yz

−

cos





, if |yz| ≤ s

0 , if yz > s

. (32)

The trigonometric loss is similar to the hinge loss in that it, except around the decision bound-

ary, penalizes misclassiﬁcations in proportion to the decision value z ∈ R and attains zero for cor-

rectly classiﬁed instances. Analogous to the once continuously differ entiable Huber loss where a

polynomial is embedded into the hinge loss , the trigonometric loss combines the perceptron loss

ℓ

(z,y) := max(0,−yz) with a trigonometric function. This trigonometrical embedding yields a

convex, twice continuously differentiable function.

Lemma 13 The trigonometric loss ℓ

(z,y) is convex and twice continuously differentiable with re-

spect to z ∈ R for any ﬁxed y ∈ Y .

Proof. Let

ℓ

′

(z,y) =







−y , if yz < −s

−

y +

ysin





, if |yz| ≤ s

0 , if yz > s

ℓ

′′

(z,y) =







0 , if yz < −s

cos





, if |yz| ≤ s

0 , if yz > s

denote the ﬁrst and second derivatives of ℓ

(z,y), respectively, with respect to z ∈ R. The trigono-

metric loss ℓ

(z,y) is convex in z ∈R (for any ﬁxed y ∈Y ) as the second derivative ℓ

′′

(z,y) is strictly

positive if |z|= |yz| < s and zero otherwise. Moreover, since the second derivative is continuous,

lim

|z|→s−

ℓ

′′

(z,y) =

cos





= 0 = lim

|z|→s+

ℓ

′′

(z,y),

the trigonometric loss is also twice continuously differentiable.

Because of the similarities of the loss functions, we call the Nash prediction game that is based

upon the trigonometric loss Nash support vector machine (NSVM) where we again consider the

regularizers in (22) and (23).

2641

UCKNER, KANZOW, AND SCHEFFER

Deﬁnition 14 The Nash support vector machine (NSVM) is an instance of the Nash prediction game

with non-empty, compact, and convex action spaces W ⊂R

and φ(X )

⊂ R

m·n

and cost functions

−1

(w,

x) :=

∑

i=1

−1,i

ℓ

) + ρ

−1

kwk

(33)

(w,

x) :=

∑

i=1

+1,i

ℓ

,−1) + ρ

∑

i=1

−x

where ℓ

is speciﬁed in (32).

The following corollary states sufﬁcient conditions under which the Nash support vector ma-

chine satisﬁes Assumptions 2 and 3, and consequently has a unique Nash equilibrium.

Corollary 15 Let the Nash support vector machine be speciﬁed as in Deﬁnition 14 with pos itive

regularization parameters ρ

−1

and ρ

which satisfy

−1

> nc

−1

, (34)

then Assumptions 2 and 3 hold, and consequently, the Nash support vector machine has a unique

Nash equilibrium.

Proof. By Deﬁnition 14, both players employ the trigonometric loss with ℓ

−1

(z,y) := ℓ

(z,y) and

ℓ

(z,y) := ℓ

(z,−1) and the regularizers in (22) and (23), respectively. Assumption 2 holds:

1. According to Lemma 13, ℓ

(z,y), and consequently ℓ

−1

(z,y) and ℓ

(z,y), are convex and

twice continuously differentiable with respect to z ∈ R (for any ﬁxed y ∈ {−1,+1}).

2. The regularizers of the Nash support vector machine are equal to that of the Nas h logistic

regression and possess the same properties as in Theorem 11.

3. By Deﬁnition 14, the players’ action sets are non-empty, compact, and convex subsets of

ﬁnite-dimensional Euclidean spaces.

Assumption 3 holds:

1. The second derivatives of ℓ

−1

(z,y) and ℓ

(z,y) are equal for all z ∈ R since

ℓ

′′

(z,y) =



cos





, if |z| ≤ s

0 , if |z| > s

does not dependent on y ∈ Y .

2. The sum of the ﬁrst derivatives of the los s functions is bounded as for y = −1:

ℓ

′

−1

(z,−1) + ℓ

′

(z,−1) = 2ℓ

′

(z,−1) =







0 , if z < −s

1 −sin



−



, if |z| ≤ s

2 , if z > s

∈ [0, 2],

2642

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

and for y = +1:

ℓ

′

−1

(z,+1) + ℓ

′

(z,+1) =







−1 , if z < −s

sin





, if |z| ≤ s

1 , if z > s

∈ [−1, 1].

Together with Equation 14, it follows that

τ = sup

(x,y)∈φ(X )×Y



ℓ

′

−1

( f

(x),y) + ℓ

′

( f

(x),y)



≤ 1.

The smallest eigenvalues of the players’ regularizers are λ

−1

= 1 and λ

, such that

inequalities

−1

> nc

−1

≥ τ

−1

hold.

3. As for Nash logistic regression, the partial gradient ∇

Ω

(x,

x) =

(

−x

) of the data

generator’s regularizer is independent of

for all j 6= i and i = 1,...,n.

Because Assumptions 2 and 3 are satisﬁed, the existence of a unique Nash equilibrium follows im-

mediately from Theorem 8.

6. Experimental Evaluation

The goal of this section is to explore the relative strengths and weaknesses of the discussed in-

stances of the Nash prediction game and existing reference methods in the context of email spam

ﬁltering. We compare a regular support vector machine (SVM), logistic regression (LR), the SVM

with trigonometric loss (SVMT, a variant of the SVM which minimizes (33) for the given training

data), the worst-case solution SVM for invariances with feature removal (Invar-SVM , Globerson

and Roweis, 2006; Teo et al., 2007), and the Nash equilibrium strategies Nash logistic regression

(NLR) and Nash support vector machine (NSVM).

data set instances features delivery period

ESP 169,612 541,713 01/06/2007 - 27/04/2010

Mailinglist 128,117 266,378 01/04/1999 - 31/05/2006

Private

108,178 582,100 01/08/2005 - 31/03/2010

TREC 2007 75,496 214,839 04/08/2007 - 07/06/2007

Table 1: Data sets used in the experiments.

We use four corpora of chronologically sorted emails detailed in Table 1: The ﬁrst data set

contains emails of an email service provider (ESP) collected between 2007 and 2010. The second

(Mailinglist) is a collection of emails from publicly available mailing lists augmented by spam

emails f rom Bruce Guenter’s spam trap of the same time period. The third corpus (Private) contains

2643

UCKNER, KANZOW, AND SCHEFFER

newsletters and spam and non-spam emails of the authors. The last corpus is the NIST TREC 2007

spam corpus.

Feature mapping φ(x) is deﬁned such that email x ∈X is tokenized with the X-tokenizer (Siefkes

et al., 2004) and converted into the m-dimensional binary bag-of-word vector x := [0, 1]

. The value

of m is determined by the number of distinct terms in the data set where we have removed all terms

which occur only once. For each experiment and each repetition, we then construct the PCA map-

ping (26) with respect to the corresponding n training emails using the linear kernel k(x,x

′

) := x

′

resulting in n-dimensional training instances φ

PCA

) ∈ R

for i = 1, . . . , n. To ensure the convexity

as well as the compactness requirement in Assumption 2, we notionally restrict the players’ action

sets by deﬁning φ(X ) := {φ

PCA

(x) ∈ R

| kφ

PCA

(x)k

≤ κ} and W := {w ∈ R

| kwk

≤ κ} for

some ﬁxed constant κ. Note that by choosing an arbitrarily large κ, the players’ action sets become

effectively unbounded.

For both algorithms, ILS and EDS, we set σ := 0.001, β := 0.2, and ε := 10

−14

. The algo-

rithms are stopped if l exceeds 30 in line 6 of ILS and line 5 in EDS, respectively; in this case, no

convergence is achieved. In all experiments, we use the F-measure—that is, the harmonic mean of

precision and recall—as evaluation measure and tune all parameters with respect to likelihood. The

particular protocol and results of each experiment are detailed in the following sections.

6.1 Convergence

Corollaries 11 (for Nash logistic regression) and 15 (for the Nash support vector machine) specify

conditions on the regularization parameters ρ

−1

and ρ

under which a unique Nash equilibrium

necessarily exists. When this is the case, both the ILS and EDS algorithms will converge on that

Nash equilibrium. In the ﬁrst set of experiments, we study whether repeated restarts of the algorithm

converge on the same equilibrium when the bounds in Equations 30 and 34 are satisﬁed, and when

they are violated to increasingly large degrees.

We set c

v,i

for v ∈ {−1,+1} and i = 1,. . . , n, such that for ρ

−1

both bounds (Equa-

tions 30 and 34) are satisﬁed. For each value of ρ

−1

and ρ

and each of 10 repetitions, we randomly

draw 400 emails from the data set and run E DS with randomly chosen initial solutions (w

(0)

)

until convergence. We run ILS on the same training set; in each repetition, we randomly choose a

distinct initial solution, and after each iteration k we compute the Euclidean distance between the

EDS solution and the current ILS iterate w

(k)

. Figure 1 reports on these average Euclidean dis-

tances between distinctly initialized runs. The blue curves (ρ

−1

= 2

) satisfy Equations 30 and

34, the yellow curves (ρ

−1

) lie exactly on the boundary; all other curves violate the bounds.

Dotted lines show the Euclidean distance between the Nash equilibrium and the solution of logistic

regression.

Our ﬁndings are as follows. Logistic regression and regular SVM never coincide with the Nash

equilibrium—the Euclidean distances lie in the range between 10

−2

and 2. ILS and EDS always

converge to identical equilibria when (30) and (34) are satisﬁed (blue and yellow curves). The

Euclidean distances lie at the threshold of numerical computing accuracy. When Equations 30 and

34 are violated by a factor up to 4 (turquoise and red curves), all repetitions still converge on the

same equilibrium, indicating that the equilibrium is either still unique or a secondary equilibrium

is unlikely to be found. When the bounds are violated by a factor of 8 or 16 (green and purple

curves), then some repetitions of the learning algorithms do not converge or start to converge to

2644

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

0 10 20 30 40

-8

-6

-4

-2

= 2

iterations

distance to NE

0 10 20 30 40

-8

-6

-4

-2

= 2

iterations

distance to NE

0 10 20 30 40

-8

-6

-4

-2

= 2

iterations

distance to NE

0 10 20 30 40

-8

-6

-4

-2

= 1

iterations

distance to NE

−1

= 2

−1

= 2

−1

= 2

−2

−1

= 2

−4

−1

= 2

−6

Figure 1: Average Euclidean distance between the EDS solution and the ILS solution at iteration

k = 0,...,40 for Nash logistic regression on the ESP corpus. The dotted lines show the

distance between the EDS solution and the solution of logistic regression. Error bars

indicate standard deviation.

distinct equilibria. In the latter case, learner and data generator may attain distinct equilibria and

may experience an arbitrarily poor outcome when playing a Nash equilibrium.

6.2 Regularization Parameters

The regularization parameters ρ

of the players v ∈ {−1, +1} play a major role in the prediction

game. The learner’s regularizer determines the generalization ability of the predictive model and

the data generator’s regularizer controls the amount of change in the data generation process. In

order to tune these parameter, one would need to have access to labeled data that are governed by

the transformed input distribution. In our second experiment, we will explore to which extent those

parameters can be estimated using a portion of the newest training data. Intuitively, the most recent

training data may be more similar to the test data than older training data.

2645

UCKNER, KANZOW, AND SCHEFFER

-4

-2

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Performance

−1

F-measure

hold-out data

test data

hold-out data

test data

-3

-2

-1

0.7

0.8

0.9

1.0

Fixed ρ

−1

F-measure

ho (ρ

= 1024)

ho (ρ

= 8)

ho (ρ

= 4)

te (ρ

= 1024)

te (ρ

= 8)

te (ρ

= 4)

0.7

0.8

0.9

1.0

Fixed ρ

−1

F-measure

ho (ρ

−1

= 0.125)

ho (ρ

−1

= 0.002)

ho (ρ

−1

= 0.0001)

te (ρ

−1

= 0.125)

te (ρ

−1

= 0.002)

te (ρ

−1

= 0.0001)

Figure 2: Left: Performance of NLR on the hold-out and the test data with respect to regularization

parameters. Right: Performance of NLR on the hold-out data (ho) and the test data (te)

for ﬁxed values of ρ

We split the data set into three parts: The 2,000 oldest emails constitute the training portion, we

use the next 2,000 emails as hold-out portion on which the parameters are tuned, and the remaining

emails are used as test set. We randomly draw 200 spam and 200 non-spam messages from the train-

ing portion and draw another subset of 400 emails from the hold-out portion. Both NPG instances

are trained on the 400 training emails and evaluated against all emails of the test portion. To tune

the parameters, we conduct a grid search maximizing the likelihood on the 400 hold-out emails.

We repeat this experiment 10 times for all four data sets and report on the resulting parameters as

well as the “optimal” reference parameters according to the maximal value of F-measure on the test

set. Those optimal regularization parameters are not used in later experiments. The intuition of the

experiment is that the data generation process has already been changed between the oldest and the

latest emails. This change may cause a distribution shift which is reﬂected in the hold-out portion.

We expect that one can tune each players’ regularization parameter by tuning with respect to this

hold-out set.

In Figure 2 (left) we plot the performance of the Nash logistic regression (NLR) on the hold-out

and the test data against the regularization parameters ρ

−1

and ρ

. The dashed line visualizes the

bound in (30) on the regularization parameters for which NLR is guaranteed to possess a unique

Nash equilibrium. Figure 2 (right) shows sectional views of the left plot along the ρ

−1

-axis (upper

diagram) and the ρ

-axis (lower diagram) for several values of ρ

and ρ

−1

, respectively. As

expected, the effect of the regularization parameters on the test data is much stronger than on the

hold-out data.

2646

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

It turns out that the data generator’s ρ

has almost no impact on the value of F-measure on

the hold-out data set (see lower right diagram of Figure 2). Hence, we conclude that estimating

without access to labeled data from the test distribution or additional knowledge about the

data generator is difﬁcult for this application; the most recent training data are still too different

from the test data. In all remaining experiments and for all data sets we set ρ

= 8 for NLR and

= 2 for NSVM. For those choices the Nash models performed generally best on the hold-out

set for a large variety of values of ρ

−1

. For Invar-SVM the regularization of the data generator’s

transformation is controlled explicitly by the number K of modiﬁable attributes per positive instance.

We conducted the same experiment for Invar-SVM resulting in an optimal value of K = 25; that is,

the data generator is allowed to remove up to 25 tokens of each spam email of the training data set.

From the upper right diagram of Figure 2 we see that estimating ρ

−1

for any ﬁxed ρ

seems

possible. Even if we slightly overestimate the learner’s optimal regularization parameter—to com-

pensate for the distributional difference between the transformed training sample and the marginal

shifted hold-out set—the determined value of ρ

−1

is close to the optimum for all four data sets.

6.3 Evaluation for Adversary Following an Equilibrium Strategy

We evaluate both a regular classiﬁer trained under the i.i.d. assumption and a model that follows a

Nash equilibrium strategy against both an adversary who does not transform the input distribution

and an adversary who executes the Nash-equilibrial transformation on the input distribution. Since

we cannot be certain that actual s pam senders play a Nash equilibrium, we use the following semi-

artiﬁcial setting.

The learner observes a sample of 200 spam and 200 non-spam emails drawn from the training

portion of the data and estimates the Nash-optimal prediction model with parameters

w; the trivial

baseline solution of regularized empirical risk minimization (ERM) is denoted by w. T he data

generator observes a distinct sample D of 200 spam and 200 non-spam messages, also drawn from

the training portion, and computes their Nash-optimal response

We again set c

v,i

for v ∈ {−1,+1} and i = 1, . . .,n and study the following four scenarios:

• (w, D) : Both players ignore the presence of an opponent; that is, the learner employs a regular

classiﬁer and the sender does not change the data generation process.

• (w,

D) : The learner ignores the presence of an active data generator who changes the data

generation process such that D evolves to

D by playing a Nash strategy.

• (

w,D) : The learner expects a rational data generator and chooses a Nash-equilibrial predic-

tion model. However, the data generator does not change the input distribution.

• (

D) : Both players are aware of the opponent and play a Nash-equilibrial action to secure

lowest costs.

We repeat this experiment 100 times for all four data sets. Table 2 reports on the average values

of F-measure over all repetitions and both NPG instances and corresponding baselines; numbers in

boldface indicate signiﬁcant differences (α = 0.05) between the F-measures of f

and f

for ﬁxed

sample D and

D, respectively.

As expected, when the data generator does not alter the input distribution, the regularized em-

pirical risk minimization baselines, logistic regression and the SVM, are generally best. However,

2647

UCKNER, KANZOW, AND SCHEFFER

NLR

vs.

ESP

D 0.957 0.924

0.912 0.925

Mailinglist

D 0.987 0.984

0.958 0.976

Private

D 0.961 0.944

0.903 0.912

TREC 2007

D 0.980 0.979

0.955 0.961

NSVM

vs.

SVM

ESP

D 0.955 0.939

0.928 0.939

Mailinglist

D 0.987 0.985

0.961 0.976

Private

D 0.961 0.957

0.932 0.936

TREC 2007

D 0.979 0.981

0.960 0.968

Table 2: Nash predictor and regular classiﬁer against passive and Nash-equilibrial data generator.

the performance of those baselines drops substantially when the data generator plays the Nash-

equilibrial action

D. The Nash-optimal prediction models are more robust against this transforma-

tion of the input distribution and signiﬁcantly outperform the reference methods for all four data

sets.

6.4 Case Study on Email Spam Filtering

To study the performance of the Nash prediction models and the baselines for email spam ﬁltering,

we evaluate all methods into the future by processing the test set in chronological order. The test

portion of each data set is split into 20 chronologically sorted disjoint subsets. We average the value

of F-measure on each of those subsets over the 20 models (trained on different samples drawn from

the training portion) for each method and perform a paired t-test. In the absence of information

about player and instance-speciﬁc costs, we again set c

v,i

for v ∈ {−1,+1}, i = 1,...,n. Note,

that the chosen loss functions and regularizers would allow us to select any positive cost factors

without violating Assumption 1.

Figure 3 shows that, for all data sets, the NPG instances outperform logistic regression (LR),

SVM, and SVMT that do not explicitly factor the adversary into the optimization criterion. Espe-

cially for the ESP corpus, the Nash logistic regression (NLR) and the Nash support vector machine

(NSVM) are superior. On the TREC 2007 data set, the methods behave comparably with a slight

advantage for the Nash support vector machine. The period over which the TREC 2007 data have

been collected is very short; we believe that the training and test instances are governed by nearly

identical distributions. Consequently, for this data set, the game theoretic models do not gain a

signiﬁcant advantage over logistic regression and the SVM that assume i.i.d. samples. With respect

to the non-game theoretic baselines, the regular SVM outperforms LR and SVMT for most of the

data sets.

Table 3 shows aggregated results over all four data sets. For each point in each of the diagrams of

Figure 3, we conduct a pairwise comparison of all methods based on a paired t-test at a conﬁdence

level of α = 0.05. When a difference is signiﬁcant, we count this as a win for the method that

achieves a higher value of F-measure. Each line of Table 3 details the wins and, set in italics, the

losses of one method against all other methods.

The Nash logistic regression and the Nash support vector machine have more wins than they

have losses against each of the other methods. The ranking continues with Invar-SVM, the regular

SVM, logistic regression and the trigonometric loss SVM which loses more frequently than it wins

against all other methods.

2648

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

Dec07 Jul08 Jan09 Aug09

0.76

0.82

0.88

0.94

F-measure

Performance on ESP corpus

Oct02 Feb04 Jul05

0.87

0.91

0.95

0.99

Performance on Mailinglist corpus

F-measure

Nov06 Apr08 Aug09

0.64

0.75

0.86

0.97

Performance on Private corpus

F-measure

Apr07 May07 Jun07 Jun07

0.93

0.95

0.97

0.99

Performance on TREC 2007 corpus

F-measure

SVM LR TSVM Invar-SVM NLR (ILS) NSVM (ILS)

Figure 3: Value of F-measure of predictive models. Error bars indicate standard errors.

method vs. method SVM LR SVMT Invar-SVM NLR NSVM

SVM 0:0 40:2 53:0 30:20 8:57 2:65

2:40 0:0 49:5 19:29 5:59 2:71

SVMT 0:53 5:49 0:0 9:47 2:70 2:74

Invar-SVM

20:30 29:19 47:9 0:0 5:57 3:57

NLR 57:8 59:5 70:2 57:5 0:0 22:30

NSVM 65:2 71:2 74:2 57:3 30:22 0:0

Table 3: Results of paired t-test over all corpora: Number of trials in which each method (row)

has signiﬁcantly outperformed each other method (column) vs. number of times it was

outperformed.

6.5 Efﬁciency versus Effectiveness

To assess the predictive performance as well as the execution time as a function of the sample size,

we train the baselines and the two NPG instances for a varying number of training examples. We

2649

UCKNER, KANZOW, AND SCHEFFER

50 100 200 400 800 1600 3200

0.75

0.8

0.85

0.9

0.95

Performance on ESP corpus

number of training emails

F-measure

50 100 200 400 800 1600 3200

-1

Execution time on ESP corpus

number of training emails

time in sec

SVM LR TSVM Invar-SVM

NLR (ILS) NSVM (ILS) NLR (EDS) NSVM (EDS)

Figure 4: Predictive performance (left) and execution time (right) for varying sizes of the training

data set.

report on the results for the ESP data set in Figure 4. The game theoretic models signiﬁcantly

outperform the trivial baseline methods logistic regression, the SVM and the SVMT, especially for

small data sets. However, this comes at the price of considerably higher computational cost. The

ILS algorithm requires in general only a couple of iterations to converge; however in each iteration

several optimization problems have to be solved so that the total execution time is up to a factor

150 larger than that of the corresponding ERM baseline. In contrast to the ILS algorithm, a single

iteration of the EDS algorithm does not require solving nested optimization problems. However, the

execution time of the EDS algorithm is still higher as it often requires several thousand iterations

to fully converge. For larger data sets, the discrepancy in predictive performance between game

theoretic models and i.i.d. baseline decreases. Our results do not provide conclusive evidence

whether ILS or EDS is faster at solving the optimization problems. We conclude that the beneﬁt of

the NPG prediction models over the classiﬁcation baseline is greatest for small to medium sample

sizes.

6.6 Nash-Equilibrial Transformation

In contrast to Invar-SVM, the Nash models allow the data generator to modify non-spam emails.

However in pr actice most senders of legitimate messages do not deliberately change their writing

behavior in order to bypass spam ﬁlters, perhaps with the exception of senders of newsletters who

must be careful not to trigger ﬁltering mechanisms.

In a ﬁnal experiment, we want to study whether the Nash model reﬂects this aspect of reality,

and how the data generator’s regularizer effects this transformation. The training portion contains

again n

= 200 spam and n

−1

= 200 non-spam instances randomly chosen from the oldest 4,000

2650

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

emails. We determine the Nash equilibrium and measure the number of additions and deletions to

spam and non-spam emails in

∆

add

−1

∑

i:y

=−1

∑

j=1

max(0,

i, j

−x

i, j

)

∆

add

∑

i:y

=+1

∑

j=1

max(0,

i, j

−x

i, j

)

∆

del

−1

∑

i:y

=−1

∑

j=1

max(0,x

i, j

−

i, j

)

∆

del

∑

i:y

=+1

∑

j=1

max(0,x

i, j

−

i, j

)

where x

i, j

indicates the presence of token j in the i-th tr aining email, that is, ∆

add

and ∆

del

denote

the average number of word additions and deletions per spam and non-spam email performed by

the sender.

Figure 5 shows the number of additions and deletions of the Nash transformation as a function

of the adversary’s regularization parameter for the ESP data set. Table 4 reports on the average

number of word additions and deletions for all data sets. For Invar-SVM, we set the number of

possible deletions to K = 25.

ESP

game non-spam spam

model add del add del

Invar-SVM 0.0 0.0 0.0 24.8

NLR 0.7 1.0 22.5 31.2

NSVM 0.4 0.5 17.9 23.8

Mailinglist

game non-spam spam

model add del add del

Invar-SVM 0.0 0.0 0.0 23.9

NLR

0.3 0.4 8.6 10.9

NSVM 0.3 0.3 6.9 8.4

Private

game non-spam spam

model add del add del

Invar-SVM 0.0 0.0 0.0 24.2

NLR

0.4 0.2 24.3 11.2

NSVM 0.1 0.1 15.6 7.3

TREC 2007

game non-spam spam

model add del add del

Invar-SVM 0.0 0.0 0.0 24.7

NLR

0.2 0.2 15.0 11.4

NSVM 0.2 0.1 11.1 8.4

Table 4: Average number of word additions and deletions per training email.

The Nash-equilibrial transformation imposes almost no changes on any non-spam email; the

number of modiﬁcations declines as the regularization parameter grows (see Figure 5). We observe

for all data s ets that even if the total amount of transformation differs for NLR and NSVM, both

instances behave similarly insofar as the number of word additions and deletions continues to grow

when the adversary’s regularizer decreases.

7. Conclusion

We studied prediction games in which the learner and the data generator have conﬂicting but not

necessarily directly antagonistic cost functions. We focused on static games in which learner and

data generator have to commit simultaneously to a predictive model and a transformation of the

input distribution, r espectively.

The cost-minimizing action of each player depends on the opponent’s move; in the absence

of information about the opponent’s move, players may choose to play a Nash equilibrium strategy

2651

UCKNER, KANZOW, AND SCHEFFER

Amount of transformation for NLR

number of modifications

non-spam additions ∆

add

-1

non-spam deletions ∆

del

-1

spam additions ∆

add

spam deletions ∆

del

Amount of transformation for NSVM

number of modifications

non-spam additions ∆

add

-1

non-spam deletions ∆

del

-1

spam additions ∆

add

spam deletions ∆

del

Figure 5: Average number of additions and deletions per spam/non-spam email for NLR (left)

and NSVM (right) with respect to the adversary’s regularization parameter ρ

for ﬁxed

−1

= n

−1

which constitutes a cost-minimizing move for each player if the other player follows the equilibrium

as well. Because a combination of actions from distinct equilibria may lead to arbitrarily high costs

for either player, we have studied conditions under which a prediction game can be guaranteed

to possess a unique Nash equilibrium. Lemma 1 identiﬁes conditions under which at least one

equilibrium exists, and Theorem 8 elaborates on when this equilibrium is unique. We propose

an inexact linesearch approach and a modiﬁed extragradient approach for identifying this unique

equilibrium. Empirically, both approaches perform quite similarly.

We derived Nash logistic regression and Nash support vector machine models and kernelized

versions of these methods. Corollaries 11 and 15 specialize Theorem 8 and expound conditions

on the player’s regularization parameters under which the Nash logistic regression and the support

vector machine converge on a unique Nash equilibrium. Empirically, we ﬁnd that both methods

identify unique Nash equilibria when the bounds laid out in Corollaries 11 and 15 are satisﬁed or

violated by a factor of up to 4. From our experiment on several email corpora we conclude that Nash

logistic regression and the support vector machine outperform their i.i.d. baselines and Invar-SVM

for the problem of classifying future emails based on training data from the past.

Acknowledgments

This work was supported by the German Science Foundation (DFG) under grant

SCHE540/12-1 and by STRATO AG. We thank Niels Landwehr and Christoph Sawade for con-

structive comments and suggestions, and the anonymous reviewer for helpful contributions and

careful proofreading of the manuscript!

2652

STATIC PREDICTION GAMES FOR ADVERSARIAL LEARNING PROBLEMS

References

Tamer Basar and Geert J. Olsder. Dynamic Noncooperative Game Theory. Society for Industrial

and Applied Mathematics, 1999.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Michael Br

uckner and Tobias Scheff er. Nash equilibria of static prediction games. In Advances in

Neural Information Processing Systems. MIT Press, 2009.

Michael Br

uckner and Tobias Scheffer. Stackelberg games for adversarial prediction problems. In

Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining (KDD), San Diego, CA, USA. ACM, 2011.

Ofer Dekel and Ohad Shamir. Learning to classify with missing and corrupted features. In Pro-

ceedings of the International Conference on Machine Learning. ACM, 2008.

Ofer Dekel, Ohad Shamir, and Lin Xiao. Learning to classify with missing and corr upted features.

Machine Learning, 81(2):149–178, 2010.

Carl Geiger and Christian Kanzow. Theorie und Numerik restringierter Optimierungsaufgaben.

Springer, 1999.

Laurent El Ghaoui, Gert R. G. Lanckriet, and Georges Natsoulis. Robust classiﬁcation with in-

terval data. Technical Report UCB/CSD-03-1279, EECS Department, University of California,

Berkeley, 2003.

Amir Globerson and Sam T. Roweis. Nightmare at test time: Robust learning by feature deletion.

In Proceedings of the International Conference on Machine Learning. ACM, 2006.

Amir Globerson, Choon Hui Teo, Alex J. Smola, and Sam T. Roweis. An adversarial view of

covariate shift and a minimax approach. In Dataset Shift in Machine Learning, pages 179–198.

MIT Press, 2009.

Patrick T. Harker and Jong-Shi Pang. Finite-dimensional variational inequality and nonlinear com-

plementarity problems: A s urvey of theory, algorithms and applications. Mathematical Program-

ming, 48(2):161–220, 1990.

Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. Cambridge University Press,

Cambridge, 1991.

James T. Kwok and Ivor W. Tsang. The pre-image problem in kernel methods. In Proceedings

of International Conference on Machine Learning, pages 408–415. AAAI Press, 2003. ISBN

1-57735-189-4.

Gert R. G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I. Jordan. A robust

minimax approach to classiﬁcation. Journal of Machine Learning Research, 3:555–582, 2002.

Sebastian Mika, Bernhard Sch

olkopf, Alex J. Smola, Klaus-Robert M

uller, Matthias Scholz, and

Gunnar R

atsch. Kernel PCA and de–noising in feature spaces. In Advances in Neural Information

Processing Systems. MIT Press, 1999.

2653

UCKNER, KANZOW, AND SCHEFFER

J. Ben Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econo-

metrica, 33(3):520–534, 1965.

Bernhard Sch

olkopf and Alex J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA,

2002.

Bernhard Sch

olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In

COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann

Publishers, 2001.

Christian Siefkes, Fidelis Assis, Shalendra Chhabra, and William S. Yerazunis. Combining Winnow

and orthogonal sparse bigrams for incremental spam ﬁltering. In Proceedings of the 8th European

Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), volume

3202 of Lecture Notes in Artiﬁcial Intelligence, pages 410–421. Springer, 2004.

Choon Hui Teo, Amir Globerson, Sam T. Roweis, and Alex J. Smola. Convex learning with invari-

ances. In Advances in Neural Information Processing Systems. MIT Press, 2007.

Anna von Heusinger and Christian Kanzow. Relaxation methods for generalized Nash equilibrium

problems with inexact line search. Journal of Optimization Theory and Applications, 143(1):

159–183, 2009.

2654