BR
¨
UCKNER, KANZOW, AND SCHEFFER
3.2.4 PRACTICAL IMPLICATIONS OF ASSUMPTIONS 2 AND 3
Theorem 8 guarantees the uniqueness of the equilibrium only if the cost functions of learner and
data generator relate in a certain way that is defined by Assumption 3. In addition, each of the
cost functions has to satisfy Assumption 2. This section discusses the practical implication of these
assumptions.
The conditions of Assumption 2 impose rather technical limitations on the cost functions . The
requirement of convexity is quite ordinary in the machine learning context. In addition, the loss
function has to be twice continuously differentiable, which restricts the family of eligible loss func-
tions. However, this condition can still be met easily; for ins tance, by smoothed versions of the
hinge loss. The second requirement of uniformly strongly convex and twice continuously differ-
entiable regularizers is, again, only a week restriction in practice. These requirements are met by
standard regularizers; they occur, for instance, in the optimization criteria of SVMs and logistic
regression. The requirement of non-empty, compact, and convex action spaces may be a restriction
when dealing with binary or multinomial attributes. However, relaxing the action spaces of the data
generator would typically result in a strategy that is more defensive than would be optimal but still
less defensive than a worst-case strategy.
The first condition of Assumptions 3 requires the cost functions of learner and data generator
to have the same curvatures. This is a crucial restriction; if the cost functions differ arbitrarily the
Nash equilibrium may not be unique. T he requirement of identical curvatures is met, for instance,
if one player chooses a loss function ℓ( f
w
(
˙
x
i
),y) which only depends on the term y f
w
(
˙
x
i
), such as
for SVM’s hinge loss or the logistic loss. In this case, the condition is met when the other player
chooses the ℓ(−f
w
(
˙
x
i
),y). This loss is in some sense the opposite of ℓ( f
w
(
˙
x
i
),y) as it approaches
zero when the other goes to infinity and vice versa. In this case, the cost functions may still be
non-antagonistic because the player’s cost functions may contain instance-specific cost factors c
v,i
that can be modeled independently for the players.
The second part of Assumptions 3 couples the degree of regularization of the players. If the data
generator produces instances at application time that differ greatly from the instances at training
time, then the learner is required to regularize strongly for a unique equilibrium to exist. If the
distributions at training and application time are more similar, the equilibrium is unique for smaller
values of the learner’s regularization parameters. This requirement is in line with the intuition that
when the training instances are a poor approximation of the distribution at application time, then
imposing only weak regularization on the loss function will result in a poor model.
The final requirement of As sumptions 3 is, again, rather a technical limitation. It states that the
interdependencies between the players’ instance-specific costs must be either captured by the regu-
larizers, leading to a full Hessian, or by cost factors. These cost factors of learner and data generator
may differ arbitrarily if the gradient of the data generator’s costs of transforming an instance x
i
into
˙
x
i
are independent of all other instances
˙
x
j
with j 6= i. This is met, for instance, by cost models that
only depend on some measure of the distance between x
i
and
˙
x
i
.
4. Finding the Unique Nash Equilibrium
According to Theorem 8, a unique equilibrium of the Nash prediction game in (3) exists for suitable
loss functions and regularizers. To find this equilibrium, we derive and study two distinct methods:
The first is based on the Nikaido-Isoda function that is constructed such that a minimax solution of
this function is an equilibrium of the Nash prediction game and vice versa. This problem is then
2632