May 2012 Report LIDS - 2884
Weighted Sup-Norm Contractions in Dynamic Programming:
A Review and Some New Applications
Dimitri P. Bertsekas
Abstract
We consider a class of generalized dynamic programming models based on weighted sup-norm contrac-
tions. We provide an analysis that parallels the one available for discounted MDP and for generalized models
based on unweighted sup-norm contractions. In particular, we discuss the main properties and associated
algorithms of these models, including value iteration, policy iteration, and their optimistic and approximate
variants. The analysis relies on several earlier works that use more specialized assumptions. In particular,
we review and extend the classical results of Denardo [Den67] for unweighted sup-norm contraction models,
as well as more recent results relating to approximation methods for discounted MDP. We also apply the
analysis to stochastic shortest path problems where all policies are assumed proper. For these problems we
extend three results that are known for discounted MDP. The first relates to the convergence of optimistic
policy iteration and extends a result of Rothblum [Rot79], the second relates to error bounds for approxi-
mate policy iteration and extends a result of Bertsekas and Tsitsiklis [BeT96], and the third relates to error
bounds for approximate optimistic policy iteration and extends a result of Thiery and Scherrer [ThS10b].
Dimitri Bertsekas is with the Dept. of Electr. Engineering and Comp. Science, M.I.T., Cambridge, Mass., 02139.
His research was supported by NSF Grant ECCS-0801549, and by the Air Force Grant FA9550-10-1-0412.
1
A Weighted Sup-Norm Contraction Framework for DP
1. INTRODUCTION
Two key structural properties of total cost dynamic programming (DP) models are responsible for most of
the mathematical results one can prove about them. The first is the monotonicity property of the mappings
associated with Bellman’s equation. In many models, however, these mappings have another property
that strengthens the effects of monotonicity: they are contraction mappings with respect to a sup-norm,
unweighted in many models such as discounted finite spaces Markovian decision problems (MDP), but also
weighted in some other models, discounted or undiscounted. An important case of the latter are stochastic
shortest path (SSP) problems under certain conditions to be discussed in Section 7.
The role of contraction mappings in discounted DP was first recognized and exploited by Shapley
[Sha53], who considered two-player dynamic games. Since that time the underlying contraction properties
of discounted DP problems have been explicitly or implicitly used by most authors that have dealt with the
subject. An abstract DP model, based on unweighted sup-norm contraction assumptions, was introduced
in an important paper by Denardo [Den67]. This model provided generality and insight into the principal
analytical and algorithmic ideas underlying the discounted DP research up to that time. Denardo’s model
motivated a related model by the author [Ber77], which relies only on monotonicity properties, and not
on contraction assumptions. These two models were used extensively in the book by Bertsekas and Shreve
[BeS78] for the analysis of both discounted and undiscounted DP problems, ranging over MDP, minimax,
risk sensitive, Borel space models, and models based on outer integration. Related analysis, motivated by
problems in communications, was given by Verd’u and Poor [VeP84], [VeP87]. See also Bertsekas and Yu
[BeY10b], which considers policy iteration methods using the abstract DP model of [Ber77].
In this paper, we extend Denardo’s model to weighted sup-norm contractions, and we provide a full
set of analytical and algorithmic results that parallel the classical ones for finite-spaces discounted MDP,
as well as some of the corresponding results for unweighted sup-norm contractions. These results include
extensions of relatively recent research on approximation methods, which have been shown for discounted
MDP with bounded cost per stage. Our motivation stems from the fact that there are important discounted
DP models with unbounded cost per stage, as well as undiscounted DP models of the SSP type, where there
is contraction structure that requires, however, a weighted sup-norm. We obtain among others, three new
algorithmic results for SSP problems, which are given in Section 7. The first relates to the convergence of
optimistic (also commonly referred to as “modified” [Put94]) policy iteration, and extends the one originally
proved by Rothblum [Rot79] within Denardo’s unweighted sup-norm contraction framework. The second
relates to error bounds for approximate policy iteration, and extends a result of Bertsekas and Tsitsiklis
[BeT96] (Prop. 6.2), given for discounted MDP, and improves on another result of [BeT96] (Prop. 6.3) for
SSP. The third relates to error bounds for approximate optimistic policy iteration, and extends a result of
Thiery and Scherrer [ThS10a], [ThS10b], given for discounted MDP. A recently derived error bound for a
Q-learning framework for optimistic policy iteration in SSP problems, due to Yu and Bertsekas [YuB11], can
also be proved using our framework.
2. A WEIGHTED SUP-NORM CONTRACTION FRAMEWORK FOR DP
Let X and U be two sets, which in view of connections to DP that will become apparent shortly, we will
loosely refer to as a set of “states” and a set of “controls.” For each x X, let U(x) U be a nonempty
subset of controls that are feasible at state x. Consistent with the DP context, we refer to a function
µ : X 7→ U with µ(x) U (x), for all x X, as a “policy.” We denote by M the set of all policies.
2
A Weighted Sup-Norm Contraction Framework for DP
Let R(X) be the set of real-valued functions J : X 7→ <, and let H : X × U × R(X) 7→ < be a given
mapping. We consider the mapping T defined by
(T J)(x) = inf
uU(x)
H(x, u, J), x X.
We assume that (T J)(x) > −∞ for all x X, so that T maps R(X) into R(X). For each policy µ M, we
consider the mapping T
µ
: R(X) 7→ R(X) defined by
(T
µ
J)(x) = H
x, µ(x), J
, x X.
We want to find a function J
*
R(X) such that
J
*
(x) = inf
uU(x)
H(x, u, J
*
), x X,
i.e., find a fixed point of T . We also want to obtain a policy µ
such that T
µ
J
*
= T J
*
.
Note that in view of the preceding definitions, H may be alternatively defined by first specifying T
µ
for all µ M [for any (x, u, J), H(x, u, J) is equal to (T
µ
J)(x) for any µ such µ(x) = u]. Moreover T may
be defined by
(T J)(x) = inf
µ∈M
(T
µ
J)(x), x X, J R(X).
We give a few examples.
Example 2.1 (Discounted DP Problems)
Consider an α-discounted total cost DP problem. Here
H(x, u, J) = E
g(x, u, w) + αJ
f(x, u, w)

,
where α (0, 1), g is a uniformly bounded function representing cost per stage, w is random with distribution
that may depend on (x, u), and is taken with respect to that distribution. The equation J = T J, i.e.,
J(x) = inf
uU(x)
H(x, u, J) = inf
uU(x)
E
g(x, u, w) + αJ
f(x, u, w)

, x X,
is Bellman’s equation, and it is known to have unique solution J
. Variants of the above mapping H are
H(x, u, J) = min
V (x), E
g(x, u, w) + αJ
f(x, u, w)

,
and
H(x, u, J) = E
g(x, u, w) + α min
V
f(x, u, w)
, J
f(x, u, w)

,
where V is a known function that satisfies V (x) J
(x) for all x X. While the use of V in these variants
of H does not affect the solution J
, it may affect favorably the value and policy iteration algorithms to be
discussed in subsequent sections.
Example 2.2 (Discounted Semi-Markov Problems)
With x, y, u as in Example 2.1, consider the mapping
H(x, u, J) = G(x, u) +
n
X
y=1
m
xy
(u)J(y),
where G is some function representing cost per stage, and m
xy
(u) are nonnegative numbers with
P
n
y=1
m
xy
(u) <
1 for all x X and u U(x). The equation J = T J is Bellman’s equation for a continuous-time semi-Markov
decision problem, after it is converted into an equivalent discrete-time problem.
3
A Weighted Sup-Norm Contraction Framework for DP
Example 2.3 (Minimax Problems)
Consider a minimax version of Example 2.1, where an antagonistic player chooses v from a set V (x, u), and let
H(x, u, J) = sup
vV (x,u)
g(x, u, v) + αJ
f(x, u, v)

.
Then the equation J = T J is Bellman’s equation for an infinite horizon minimax DP problem. A generalization
is a mapping of the form
H(x, u, J) = sup
vV (x,u)
E
g(x, u, v, w) + αJ
f(x, u, v, w)

,
where w is random with given distribution, and the expected value is with respect to that distribution. This
form appears in zero-sum sequential games [Sha53].
Example 2.4 (Deterministic and Stochastic Shortest Path Problems)
Consider a classical deterministic shortest path problem involving a graph of n nodes x = 1, . . . , n, plus a
destination 0, an arc length a
xu
for each arc (x, u), and the mapping
H(x, u, J) =
a
xu
+ J(u) if u 6= 0,
a
0t
if u = 0,
x = 1, . . . , n, u = 0, 1, . . . , n.
Then the equation J = T J is Bellman’s equation for the shortest distances J
(x) from the nodes x to node 0.
A generalization is a mapping of the form
H(x, u, J) = p
x0
(u)g(x, u, 0) +
n
X
y=1
p
xy
(u)
g(x, u, y) + J(y)
, x = 1, . . . , n.
It corresponds to a SSP problem, which is described in Section 7. A special case is stochastic finite-horizon,
finite-state DP problems.
Example 2.5 (Q-Learning I)
Consider the case where X is the set of state-control pairs (i, w), i = 1, . . . , n, w W (i), of an MDP with
controls w taking values at state i from a finite set W (i). Let T
µ
map a Q-factor vector
Q =
Q(i, w) | i = 1, . . . , n, w W (i)
into the Q-factor vector
¯
Q
µ
=
¯
Q
µ
(i, w) | i = 1, . . . , n, w W (i)
with components given by
¯
Q
µ
(i, w) = g(i, w) + α
n
X
j=1
p
ij
µ(i)
min
vW (j)
Q(j, v), i = 1, . . . , n, w W (i).
This mapping corresponds to the classical Q-learning mapping of a finite-state MDP [in relation to the stan-
dard Q-learning framework, [Tsi94], [BeT96], [SuB98], µ applies a control µ(i) from the set U(i, w) = W (i)
independently of the value of w W (i)]. If α (0, 1), the MDP is discounted, while if α = 1, the MDP is
undiscounted and when there is a cost-free and absorbing state, it has the character of the SSP problem of the
preceding example.
4
A Weighted Sup-Norm Contraction Framework for DP
Example 2.6 (Q-Learning II)
Consider an alternative Q-learning framework introduced in [BeY10a] for discounted MDP and in [YuB11] for
SSP, where T
µ
operates on pairs (Q, V ), and using the notation of the preceding example, Q is a Q-factor and
V is a cost vector of the forms
Q(i, w) | i = 1, . . . , n, w W (i)
,
V (i) | i = 1, . . . , n
.
Let T
µ
map a pair (Q, V ) into the pair (
¯
Q
µ
,
¯
V
µ
) with components given by
¯
Q
µ
(i, w) = g(i, w) + α
n
X
j=1
p
ij
µ(i)
ν(v | j) min
V (j), Q(j, v)
, i = 1, . . . , n, w W (i),
¯
V
µ
(i) = min
wW (i)
¯
Q
µ
(i, w), i = 1, . . . , n,
where ν(· | j) is a given conditional distribution over W (j), and α (0, 1) for a discounted MDP and α = 1 for
an SSP problem.
We also note a variety of discounted countable-state MDP models with unbounded cost per stage,
whose Bellman equation mapping involves a weighted sup-norm contraction. Such models are described in
several sources, starting with works of Harrison [Har72], and Lippman [Lip73], [Lip75] (see also [Ber12],
Section 1.5, and [Put94], and the references quoted there).
Consider a function v : X 7→ < with
v(x) > 0, x X,
denote by B(X) the space of real-valued functions J on X such that J(x)/v(x) is bounded as x ranges over
X, and consider the weighted sup-norm
kJk = sup
xX
J(x)
v(x)
on B(X). We will use the following assumption.
Assumption 2.1: (Contraction) For all J B(X) and µ M, the functions T
µ
J and T J belong
to B(X). Furthermore, for some α (0, 1), we have
kT
µ
J T
µ
J
0
k αkJ J
0
k, J, J
0
B(X), µ M. (2.1)
An equivalent way to state the condition (2.1) is
H(x, u, J) H(x, u, J
0
)
v(x)
αkJ J
0
k, x X, u U (x), J, J
0
B(X).
Note that Eq. (2.1) implies that
kT J T J
0
k αkJ J
0
k, J, J
0
B(X). (2.2)
5
A Weighted Sup-Norm Contraction Framework for DP
To see this we write
(T
µ
J)(x) (T
µ
J
0
)(x) + αkJ J
0
k v(x), x X,
from which, by taking infimum of both sides over µ M, we have
(T J)(x) (T J
0
)(x)
v(x)
αkJ J
0
k, x X.
Reversing the roles of J and J
0
, we also have
(T J
0
)(x) (T J)(x)
v(x)
αkJ J
0
k, x X,
and combining the preceding two relations, and taking the supremum of the left side over x X, we obtain
Eq. (2.2).
It can be seen that the Contraction Assumption 2.1 is satisfied for the mapping H in Examples 2.1-
2.3, and the discounted cases of 2.5-2.6, with v equal to the unit function, i.e., v(x) 1. Generally, the
assumption is not satisfied in Example 2.4, and the undiscounted cases of Examples 2.5-2.6, but it will be
seen later that it is satisfied for the special case of the SSP problem under the assumption that all stationary
policies are proper (lead to the destination with probability 1, starting from every state). In that case,
however, we cannot take v(x) 1, and this is one of our motivations for considering the more general case
where v is not the unit function.
The next two examples show how starting with mappings satisfying the contraction assumption, we
can obtain multistep mappings with the same fixed points and a stronger contraction modulus. For any
J R(X), we denote by T
µ
0
· · · T
µ
k
J the composition of the mappings T
µ
0
, . . . , T
µ
k
applied to J, i.e,
T
µ
0
· · · T
µ
k
J =
T
µ
0
T
µ
1
· · · (T
µ
k1
(T
µ
k
J)) · · ·

.
Example 2.7 (Multistep Mappings)
Consider a set of mappings T
µ
: <
n
7→ <
n
, µ M, satisfying Assumption 2.1, let m be a positive integer,
and let
¯
M be the set of m-tuples ν = (µ
0
, . . . , µ
m1
), where µ
k
M, k = 1, . . . , m 1. For each ν =
(µ
0
, . . . , µ
m1
)
¯
M, define the mapping
¯
T
ν
, by
¯
T
ν
J = T
µ
0
· · · T
µ
m1
J, J B(X).
Then we have the contraction properties
k
¯
T
ν
J
¯
T
ν
J
0
k α
m
kJ J
0
k, J, J
0
B(X),
and
k
¯
T J
¯
T J
0
k α
m
kJ J
0
k, J, J
0
B(X),
where
¯
T is defined by
(
¯
T J)(x) = inf
(µ
0
,...,µ
m1
)
¯
M
(T
µ
0
· · · T
µ
m1
J)(x), J B(X), x X.
Thus the mappings
¯
T
ν
, ν
¯
M, satisfy Assumption 2.1, and have contraction modulus α
m
.
The following example considers mappings underlying weighted Bellman equations that arise in various
computational contexts in approximate DP; see Yu and Bertsekas [YuB12] for analysis, algorithms, and
related applications.
6
A Weighted Sup-Norm Contraction Framework for DP
Example 2.8 (Weighted Multistep Mappings)
Consider a set of mappings L
µ
: B(X) 7→ B(X), µ M, satisfying Assumption 2.1, i.e., for some α (0, 1),
kL
µ
J L
µ
J
0
k αkJ J
0
k, J, J
0
B(X), µ M.
Consider also the mappings T
µ
: B(X) 7→ B(X) defined by
(T
µ
J)(x) =
X
`=1
w
`
(x)(L
`
µ
J)(x), x X, J <
n
,
where w
`
(x) are nonnegative scalars such that for all x X,
X
`=1
w
`
(x) = 1.
Then it follows that
kT
µ
J T
µ
J
0
k
X
`=1
w
`
(x)α
`
kJ J
0
k,
showing that T
µ
is a contraction with modulus
¯α = max
xX
X
`=1
w
`
(x) α
`
α.
Moreover L
µ
and T
µ
have a common fixed point for all µ M, and the same is true for the corresponding
mappings L and T .
We will now consider some general questions, first under the Contraction Assumption 2.1, and then
under an additional monotonicity assumption. Most of the results of this section are straightforward ex-
tensions of results that appear in Denardo’s paper [Den67] for the case where the sup-norm is unweighted
[v(x) 1].
2.1 Basic Results Under the Contraction Assumption
The contraction property of T
µ
and T can be used to show the following proposition.
Proposition 2.1: Let Assumption 2.1 hold. Then:
(a) The mappings T
µ
and T are contraction mappings with modulus α over B(X), and have unique
fixed points in B(X), denoted J
µ
and J
*
, respectively.
(b) For any J B(X) and µ M,
lim
k→∞
T
k
µ
J = J
µ
, lim
k→∞
T
k
J = J
*
.
7
A Weighted Sup-Norm Contraction Framework for DP
(c) We have T
µ
J
*
= T J
*
if and only if J
µ
= J
*
.
(d) For any J B(X),
kJ
*
Jk
1
1 α
kT J Jk, kJ
*
T Jk
α
1 α
kT J Jk.
(e) For any J B(X) and µ M,
kJ
µ
Jk
1
1 α
kT
µ
J Jk, kJ
µ
T
µ
Jk
α
1 α
kT
µ
J Jk.
Proof: We have already shown that T
µ
and T are contractions with modulus α over B(X) [cf. Eqs. (2.1)
and (2.2)]. Parts (a) and (b) follow from the classical contraction mapping fixed point theorem. To show part
(c), note that if T
µ
J
*
= T J
*
, then in view of T J
*
= J
*
, we have T
µ
J
*
= J
*
, which implies that J
*
= J
µ
,
since J
µ
is the unique fixed point of T
µ
. Conversely, if J
µ
= J
*
, we have T
µ
J
*
= T
µ
J
µ
= J
µ
= J
*
= T J
*
.
To show part (d), we use the triangle inequality to write for every k,
kT
k
J Jk
k
X
`=1
kT
`
J T
`1
Jk
k
X
`=1
α
`1
kT J Jk.
Taking the limit as k and using part (b), the left-hand side inequality follows. The right-hand side
inequality follows from the left-hand side and the contraction property of T . The proof of part (e) is similar
to part (d) [indeed part (e) is the special case of part (d) where T is equal to T
µ
, i.e., when U(x) =
µ(x)
for all x X]. Q.E.D.
Part (c) of the preceding proposition shows that there exists a µ M such that J
µ
= J
*
if and only if
the minimum of H(x, u, J
*
) over U(x) is attained for all x X. Of course the minimum is attained if U(x)
is finite for every x, but otherwise this is not guaranteed in the absence of additional assumptions. Part (d)
provides a useful error bound: we can evaluate the proximity of any function J B(X) to the fixed point
J
*
by applying T to J and computing kT J Jk. The left-hand side inequality of part (e) (with J = J
*
)
shows that for every > 0, there exists a µ
M such that kJ
µ
J
*
k , which may be obtained by
letting µ
(x) minimize H(x, u, J
*
) over U (x) within an error of (1 α) v(x), for all x X.
2.2 The Role of Monotonicity
Our analysis so far in this section relies only on the contraction assumption. We now introduce a monotonicity
property of a type that is common in DP.
8
A Weighted Sup-Norm Contraction Framework for DP
Assumption 2.2: (Monotonicity) If J, J
0
R(X) and J J
0
, then
H(x, u, J) H(x, u, J
0
), x X, u U(x). (2.3)
Note that the assumption is equivalent to
J J
0
T
µ
J T
µ
J
0
, µ M,
and implies that
J J
0
T J T J
0
.
An important consequence of monotonicity of H, when it holds in addition to contraction, is that it implies
an optimality property of J
*
.
Proposition 2.2: Let Assumptions 2.1 and 2.2 hold. Then
J
*
(x) = inf
µ∈M
J
µ
(x), x X. (2.4)
Furthermore, for every > 0, there exists µ
M such that
J
*
(x) J
µ
(x) J
*
(x) + v(x), x X. (2.5)
Proof: We note that the right-hand side of Eq. (2.5) holds by Prop. 2.1(e) (see the remark following its
proof). Thus inf
µ∈M
J
µ
(x) J
*
(x) for all x X. To show the reverse inequality as well as the left-hand
side of Eq. (2.5), we note that for all µ M, we have T J
*
T
µ
J
*
, and since J
*
= T J
*
, it follows that
J
*
T
µ
J
*
. By applying repeatedly T
µ
to both sides of this inequality and by using the Monotonicity
Assumption 2.2, we obtain J
*
T
k
µ
J
*
for all k > 0. Taking the limit as k , we see that J
*
J
µ
for all
µ M. Q.E.D.
Propositions 2.1 and 2.2 collectively address the problem of finding µ M that minimizes J
µ
(x)
simultaneously for all x X, consistently with DP theory. The optimal value of this problem is J
*
(x), and
µ is optimal for all x if and only if T
µ
J
*
= T J
*
. For this we just need the contraction and monotonicity
assumptions. We do not need any additional structure of H, such as for example a discrete-time dynamic
system, transition probabilities, etc. While identifying the proper structure of H and verifying its contraction
and monotonicity properties may require some analysis that is specific to each type of problem, once this is
done significant results are obtained quickly.
9
A Weighted Sup-Norm Contraction Framework for DP
Note that without monotonicity, we may have inf
µ∈M
J
µ
(x) < J
*
(x) for some x. As an example, let
X = {x
1
, x
2
}, U = {u
1
, u
2
}, and let
H(x
1
, u, J) =
αJ(x
2
) if u = u
1
,
1 + αJ(x
1
) if u = u
2
,
H(x
2
, u, J) =
0 if u = u
1
,
B if u = u
2
,
where B is a positive scalar. Then it can be seen that
J
*
(x
1
) =
1
1 α
, J
*
(x
2
) = 0,
and J
µ
= J
*
where µ
(x
1
) = u
2
and µ
(x
2
) = u
1
. On the other hand, for µ(x
1
) = u
1
and µ(x
2
) = u
2
, we
have
J
µ
(x
1
) = αB, J
µ
(x
2
) = B,
so J
µ
(x
1
) < J
*
(x
1
) for B sufficiently large.
Nonstationary Policies
The connection with DP motivates us to consider the set Π of all sequences π = {µ
0
, µ
1
, . . .} with µ
k
M
for all k (nonstationary policies in the DP context), and define
J
π
(x) = lim inf
k→∞
(T
µ
0
· · · T
µ
k
J)(x), x X,
with J being any function in B(X), where as earlier, T
µ
0
· · · T
µ
k
J denotes the composition of the mappings
T
µ
0
, . . . , T
µ
k
applied to J. Note that the choice of J in the definition of J
π
does not matter since for any
two J, J
0
B(X), we have from the Contraction Assumption 2.1,
kT
µ
0
T
µ
1
· · · T
µ
k
J T
µ
0
T
µ
1
· · · T
µ
k
J
0
k α
k+1
kJ J
0
k,
so the value of J
π
(x) is independent of J. Since by Prop. 2.1(b), J
µ
(x) = lim
k→∞
(T
k
µ
J)(x) for all µ M,
J B(X), and x X, in the DP context we recognize J
µ
as the cost function of the stationary policy
{µ, µ, . . .}.
We now claim that under our Assumptions 2.1 and 2.2, J
*
, the fixed point of T , is equal to the optimal
value of J
π
, i.e.,
J
*
(x) = inf
πΠ
J
π
(x), x X.
Indeed, since M defines a subset of Π, we have from Prop. 2.2,
J
*
(x) = inf
µ∈M
J
µ
(x) inf
πΠ
J
π
(x), x X,
while for every π Π and x X, we have
J
π
(x) = lim inf
k→∞
(T
µ
0
T
µ
1
· · · T
µ
k
J)(x) lim
k→∞
(T
k+1
J)(x) = J
*
(x)
[the Monotonicity Assumption 2.2 can be used to show that
T
µ
0
T
µ
1
· · · T
µ
k
J T
k+1
J,
and the last equality holds by Prop. 2.1(b)]. Combining the preceding relations, we obtain J
*
(x) =
inf
πΠ
J
π
(x).
Thus, in DP terms, we may view J
*
as an optimal cost function over all nonstationary policies. At the
same time, Prop. 2.2 states that stationary policies are sufficient in the sense that the optimal cost can be
attained to within arbitrary accuracy with a stationary policy [uniformly for all x X, as Eq. (2.5) shows].
10
A Weighted Sup-Norm Contraction Framework for DP
Periodic Policies
Consider the multistep mappings
¯
T
ν
= T
µ
0
· · · T
µ
m1
, ν
¯
M, defined in Example 2.7, where
¯
M is the set
of m-tuples ν = (µ
0
, . . . , µ
m1
), with µ
k
M, k = 1, . . . , m 1. Assuming that the mappings T
µ
satisfy
Assumptions 2.1 and 2.2, the same is true for the mappings
¯
T
ν
(with the contraction modulus of
¯
T
ν
being
α
m
). Thus the unique fixed point of
¯
T
ν
is J
π
, where π is the nonstationary but periodic policy
π = {µ
0
, . . . , µ
m1
, µ
0
, . . . , µ
m1
, . . .}.
Moreover the mappings T
µ
0
· · · T
µ
m1
, T
µ
1
· · · T
µ
m1
T
µ
0
, . . . , T
µ
m1
T
µ
0
· · · T
µ
m2
, have unique corresponding
fixed points J
0
, J
1
, . . . , J
m1
, which satisfy
J
0
= T
µ
1
J
1
, J
1
= T
µ
2
J
2
, . . . J
µ
m2
= T
µ
m1
J
µ
m1
, J
µ
m1
= T
µ
0
J
0
.
To verify the above equations, multiply the relation J
1
= T
µ
1
· · · T
µ
m1
T
µ
0
J
1
with T
µ
0
to show that T
µ
0
J
1
is
the fixed point of T
µ
0
· · · T
µ
m1
, i.e., is equal to J
0
, etc. Note that even though
¯
T
ν
defines the cost functions
of periodic policies,
¯
T has the same fixed point as T , namely J
*
. This gives rise to the computational
possibility of working with
¯
T
ν
in place of T
µ
in an effort to find J
*
. Moreover, periodic policies obtained
through approximation methods, such as the ones to be discussed in what follows, may hold an advantage
over stationary policies, as first shown by Scherrer [Sch12] in the context of approximate value iteration (see
also the discussion of Section 4).
Error Bounds Under Monotonicity
The assumptions of contraction and monotonicity together can be characterized in a form that is useful for
the derivation of error bounds.
Proposition 2.3: The Contraction and Monotonicity Assumptions 2.1 and 2.2 hold if and only if
for all J, J
0
B(X), µ M, and scalar c 0, we have
J
0
J + c v T
µ
J
0
T
µ
J + αc v, (2.6)
where v is the weight function of the weighted sup-norm k · k.
Proof: Let the contraction and monotonicity assumptions hold. If J
0
J + c v, we have
H(x, u, J
0
) H(x, u, J + c v) H(x, u, J) + αc v(x), x X, u U (x), (2.7)
where the left-side inequality follows from the monotonicity assumption and the right-side inequality follows
from the contraction assumption, which together with kvk = 1, implies that
H(x, u, J + c v) H(x, u, J)
v(x)
αkJ + c v Jk = αc.
The condition (2.7) implies the desired condition (2.6). Conversely, condition (2.6) for c = 0 yields the
monotonicity assumption, while for c = kJ
0
Jk it yields the contraction assumption. Q.E.D.
11
A Weighted Sup-Norm Contraction Framework for DP
We can use Prop. 2.3 to derive some useful variants of parts (d) and (e) of Prop. 2.1 (which assumes only
the contraction assumption). These variants will be used in the derivation of error bounds for computational
methods to be discussed in Sections 4-6.
Proposition 2.4: (Error Bounds Under Contraction and Monotonicity) Let Assumptions
2.1 and 2.2 hold.
(a) For any J B(X) and c 0, we have
T J J + c v J
*
J +
c
1 α
v,
J T J + c v J J
*
+
c
1 α
v.
(b) For any J B(X), µ M, and c 0, we have
T
µ
J J + c v J
µ
J +
c
1 α
v,
J T
µ
J + c v J J
µ
+
c
1 α
v.
(c) For all J B(X), c 0, and k = 0, 1, . . ., we have
T J J + c v J
*
T
k
J +
α
k
c
1 α
v,
J T J + c v T
k
J J
*
+
α
k
c
1 α
v.
Proof: (a) We show the first relation. Applying Eq. (2.6) with J
0
replaced by T J, and taking minimum
over u U (x) for all x X, we see that if T J J + c v, then T
2
J T J + αc v. Proceeding similarly, it
follows that
T
`
J T
`1
J + α
`1
v.
We now write for every k,
T
k
J J =
k
X
`=1
(T
`
J T
`1
J)
k
X
`=1
α
`1
c v,
from which, by taking the limit as k , we obtain J
*
J +
c/(1 α)
v. The second relation follows
similarly.
(b) This part is the special case of part (a) where T is equal to T
µ
.
(c) We show the first relation. From part (a), the inequality TJ J +c v implies that J
*
J +
c/(1 α)
v.
Applying T
k
to both sides of this inequality, and using the fact that T
k
is a monotone sup-norm contraction
12
A Weighted Sup-Norm Contraction Framework for DP
of modulus α
k
, with fixed point J
*
, we obtain J
*
T
k
J +
α
k
c/(1 α)
v. The second relation follows
similarly. Q.E.D.
Approximations
As part of our subsequent analysis, we will consider approximations in the implementation of various VI and
PI algorithms. In particular, we will assume that given any J B(X), we cannot compute exactly T J, but
instead may compute
¯
J B(X) and µ M such that
k
¯
J T Jk δ, kT
µ
J T Jk , (2.8)
where δ and are nonnegative scalars. These scalars may be unknown, so the resulting analysis will have a
mostly qualitative character.
The case δ > 0 arises when the state space is either infinite or it is finite but very large. Then instead
of calculating (TJ)(x) for all states x, one may do so only for some states and estimate (T J)(x) for the
remaining states x by some form of interpolation. Alternatively, one may use simulation data [e.g., noisy
values of (T J)(x) for some or all x] and some kind of least-squares error fit of (T J)(x) with a function from
a suitable parametric class. The function
¯
J thus obtained will satisfy a relation such as (2.8) with δ > 0.
Note that δ may not be small in this context, and the resulting performance degradation may be a primary
concern.
Cases where > 0 may arise when the control space is infinite or finite but large, and the minimization
involved in the calculation of (T J)(x) cannot be done exactly. Note, however, that it is possible that
δ > 0, = 0,
and in fact this occurs in several types of practical methods. In an alternative scenario, we may first obtain
the policy µ subject to a restriction that it belongs to a certain subset of structured policies, so it satisfies
kT
µ
J T Jk for some > 0, and then we may set
¯
J = T
µ
J. In this case we have = δ.
3. LIMITED LOOKAHEAD POLICIES
A frequently used suboptimal approach in DP is to use a policy obtained by solving a finite-horizon problem
with some given terminal cost function
˜
J that approximates J
*
. The simplest possibility is a one-step
lookahead policy ¯µ defined by
¯µ(x) arg min
uU(x)
H(x, u,
˜
J), x X. (3.1)
In a variant of the method that aims to reduce the computation to obtain ¯µ(i), the minimization in Eq. (3.1)
is done over a subset
¯
U(x) U(x). Thus, the control ¯µ(x) used in this variant satisfies
¯µ(x) arg min
u
¯
U(i)
H(x, u,
˜
J), x X,
rather Eq. (3.1). This is attractive for example when by using some heuristic or approximate optimization,
we can identify a subset
¯
U(x) of promising controls, and to save computation, we restrict attention to this
subset in the one-step lookahead minimization.
13
A Weighted Sup-Norm Contraction Framework for DP
The following proposition gives some bounds for the performance of such a one-step lookahead policy.
The first bound [part (a) of the following proposition] is given in terms of the vector
ˆ
J given by
ˆ
J(x) = inf
u
¯
U(i)
H(x, u,
˜
J), x X, (3.2)
which is computed in the course of finding the one-step lookahead control at state x.
Proposition 3.1: (One-Step Lookahead Error Bounds) Let Assumptions 2.1 and 2.2 hold,
and let ¯µ be a one-step lookahead policy obtained by minimization in Eq. (3.2).
(a) Assume that
ˆ
J
˜
J. Then J
¯µ
ˆ
J.
(b) Assume that
¯
U(i) = U(i) for all i. Then
kJ
¯µ
ˆ
Jk
α
1 α
k
ˆ
J
˜
Jk, (3.3)
where k · k denotes the sup-norm. Moreover
kJ
¯µ
J
*
k
2α
1 α
k
˜
J J
*
k, (3.4)
and
kJ
¯µ
J
*
k
2
1 α
k
ˆ
J
˜
Jk. (3.5)
Proof: (a) We have
˜
J
ˆ
J = T
¯µ
˜
J,
from which by using the monotonicity of T
¯µ
, we obtain
˜
J
ˆ
J T
k
¯µ
˜
J T
k+1
¯µ
˜
J, k = 1, 2, . . .
By taking the limit as k , we have
ˆ
J J
¯µ
.
(b) The proof of this part may rely on Prop. 2.1(e), but we will give a direct proof. Using the triangle
inequality we write for every k,
kT
k
¯µ
ˆ
J
ˆ
Jk
k
X
`=1
kT
`
¯µ
ˆ
J T
`1
¯µ
ˆ
Jk
k
X
`=1
α
`1
kT
¯µ
ˆ
J
ˆ
Jk.
By taking the limit as k and using the fact T
k
¯µ
ˆ
J J
¯µ
, we obtain
kJ
¯µ
ˆ
Jk
1
1 α
kT
¯µ
ˆ
J
ˆ
Jk. (3.6)
Since
ˆ
J = T
¯µ
˜
J, we have
kT
¯µ
ˆ
J
ˆ
Jk = kT
¯µ
ˆ
J T
¯µ
˜
Jk αk
ˆ
J
˜
Jk,
14
A Weighted Sup-Norm Contraction Framework for DP
and Eq. (3.3) follows by combining the last two relations.
By repeating the proof of Eq. (3.6) with
ˆ
J replaced by J
*
, we obtain
kJ
¯µ
J
*
k
1
1 α
kT
¯µ
J
*
J
*
k.
We have T
˜
J = T
¯µ
˜
J and J
*
= T J
*
, so
kT
¯µ
J
*
J
*
k kT
¯µ
J
*
T
¯µ
˜
Jk| + kT
˜
J T J
*
k
αkJ
*
˜
Jk + αk
˜
J J
*
k
= 2α k
˜
J J
*
k,
and Eq. (3.4) follows by combining the last two relations.
Also, by repeating the proof of Eq. (3.6) with
ˆ
J replaced by
˜
J and T
¯µ
replaced by T , we have using
also
ˆ
J = T
˜
J,
kJ
*
˜
Jk
1
1 α
kT
˜
J
˜
Jk =
1
1 α
k
ˆ
J
˜
Jk.
We use this relation to write
kJ
¯µ
J
*
k kJ
¯µ
ˆ
Jk + k
ˆ
J
˜
Jk + k
˜
J J
*
k
α
1 α
k
ˆ
J
˜
Jk + k
ˆ
J
˜
Jk +
1
1 α
k
ˆ
J
˜
Jk
=
2
1 α
k
ˆ
J
˜
Jk,
where the second inequality follows from Eq. (3.3). Q.E.D.
Part (b) of the preceding proposition gives a bound on J
¯µ
(x), the performance of the one-step lookahead
policy ¯µ; the value of
ˆ
J(x) is obtained while finding the one-step lookahead control at x. The bound (3.4)
of part (c) says that if the one-step lookahead approximation
˜
J is within c of the optimal (in the weighted
sup-norm sense), the performance of the one-step lookahead policy is within 2αc/(1α) of the optimal. Part
(b) of the preceding proposition gives bounds on J
¯µ
(x), the performance of the one-step lookahead policy ¯µ.
In particular, the bound (3.4) says that if the one-step lookahead approximation
˜
J is within of the optimal,
the performance of the one-step lookahead policy is within 2α/(1 α) of the optimal. Unfortunately, this is
not very reassuring when α is close to 1, in which case the error bound is very large relative to . Nonetheless,
the following example from [BeT96], Section 6.1.1, shows that this error bound is tight in the sense that for
any α < 1, there is a problem with just two states where the error bound is satisfied with equality. What is
happening is that an O() difference in single stage cost between two controls can generate an O
/(1 α)
difference in policy costs, yet it can be “nullified” in Bellman’s equation by an O() difference between J
*
and
˜
J.
Example 3.1
Consider a discounted problem with two states, 1 and 2, and deterministic transitions. State 2 is absorbing,
but at state 1 there are two possible decisions: move to state 2 (policy µ
) or stay at state 1 (policy µ). The
cost of each transition is 0 except for the transition from 1 to itself under policy µ, which has cost 2α, where
is a positive scalar and α [0, 1) is the discount factor. The optimal policy µ
is to move from state 1 to
15
A Weighted Sup-Norm Contraction Framework for DP
state 2, and the optimal cost-to-go function is J
(1) = J
(2) = 0. Consider the vector
˜
J with
˜
J(1) = and
˜
J(2) = , so that
k
˜
J J
k = ,
as assumed in Eq. (3.4) [cf. Prop. 3.1(b)]. The policy µ that decides to stay at state 1 is a one-step lookahead
policy based on
˜
J, because
2α + α
˜
J(1) = α = 0 + α
˜
J(2).
We have
J
µ
(1) =
2α
1 α
=
2α
1 α
k
˜
J J
k,
so the bound of Eq. (3.4) holds with equality.
3.1 Multistep Lookahead Policies with Approximations
Let us now consider a more general form of lookahead involving multiple stages with intermediate approx-
imations. In particular, we assume that given any J B(X), we cannot compute exactly T J, but instead
may compute
¯
J B(X) and µ M such that
k
¯
J T Jk δ, kT
µ
J T Jk , (3.7)
where δ and are nonnegative scalars.
In a multistep method with approximations, we are given a positive integer m and a lookahead function
J
m
, and we successively compute (backwards in time) J
m1
, . . . , J
0
and policies µ
m1
, . . . , µ
0
satisfying
kJ
k
T J
k+1
k δ, kT
µ
k
J
k+1
T J
k+1
k , k = 0, . . . , m 1. (3.8)
Note that in the context of MDP, J
k
can be viewed as an approximation to the optimal cost function of an
(m k)-stage problem with terminal cost function J
m
. We have the following proposition, which is based
on the recent work of Scherrer [Sch12].
Proposition 3.2: (Multistep Lookahead Error Bound) Let Assumption 2.1 hold. The periodic
policy
π = {µ
0
, . . . , µ
m1
, µ
0
, . . . , µ
m1
, . . .}
generated by the method of Eq. (3.8) satisfies
kJ
π
J
*
k
2α
m
1 α
m
kJ
m
J
*
k +
1 α
m
+
α( + 2δ)(1 α
m1
)
(1 α)(1 α
m
)
. (3.9)
Proof: Using the triangle inequality, Eq. (3.8), and the contraction property of T , we have for all k
kJ
mk
T
k
J
m
k kJ
mk
T J
mk+1
k + kT J
mk+1
T
2
J
mk+2
k
+ · · · + kT
k1
J
m1
T
k
J
m
k
δ + αδ + · · · + α
k1
δ,
(3.10)
16
A Weighted Sup-Norm Contraction Framework for DP
showing that
kJ
mk
T
k
J
m
k
δ(1 α
k
)
1 α
, k = 1, . . . , m. (3.11)
From Eq. (3.8), we have kJ
k
T
µ
k
J
k+1
k δ + , so for all k
kJ
mk
T
µ
mk
· · · T
µ
m1
J
m
k kJ
mk
T
µ
mk
J
mk+1
k
+ kT
µ
mk
J
mk+1
T
µ
mk
T
µ
mk+1
J
mk+2
k
+ · · ·
+ kT
µ
mk
· · · T
µ
m2
J
m1
T
µ
mk
· · · T
µ
m1
J
m
k
(δ + ) + α(δ + ) + · · · + α
k1
(δ + ),
showing that
kJ
mk
T
µ
mk
· · · T
µ
m1
J
m
k
(δ + )(1 α
k
)
1 α
, k = 1, . . . , m. (3.12)
Using the fact kT
µ
0
J
1
T J
1
k [cf. Eq. (3.8)], we obtain
kT
µ
0
· · · T
µ
m1
J
m
T
m
J
m
k kT
µ
0
· · · T
µ
m1
J
m
T
µ
0
J
1
k
+ kT
µ
0
J
1
T J
1
k + kT J
1
T
m
J
m
k
αkT
µ
1
· · · T
µ
m1
J
m
J
1
k + + αkJ
1
T
m1
J
m
k
+
α( + 2δ)(1 α
m1
)
1 α
,
where the last inequality follows from Eqs. (3.11) and (3.12) for k = m 1.
From this relation and the fact that T
µ
0
· · · T
µ
m1
and T
m
are contractions with modulus α
m
, we obtain
kT
µ
0
· · · T
µ
m1
J
*
J
*
k kT
µ
0
· · · T
µ
m1
J
*
T
µ
0
· · · T
µ
m1
J
m
k
+ kT
µ
0
· · · T
µ
m1
J
m
T
m
J
m
k + kT
m
J
m
J
*
k
2α
m
kJ
*
J
m
k + +
α( + 2δ)(1 α
m1
)
1 α
.
We also have using Prop. 2.1(e), applied in the context of the multistep mapping of Example 1.6.5 of Section
1.6,
kJ
π
J
*
k
1
1 α
m
kT
µ
0
· · · T
µ
m1
J
*
J
*
k.
Combining the last two relations, we obtain the desired result. Q.E.D.
Note that for m = 1 and δ = = 0, i.e., the case of one-step lookahead policy ¯µ with lookahead function
J
1
and no approximation error in the minimization involved in T J
1
, Eq. (3.9) yields the bound
kJ
¯µ
J
*
k
2α
1 α
kJ
1
J
*
k,
which coincides with the bound (3.4) derived earlier.
Also, in the special case where = δ and J
k
= T
µ
k
J
k+1
(cf. the discussion preceding Prop. 3.2), the
bound (3.9) can be strengthened somewhat. In particular, we have for all k, J
mk
= T
µ
mk
· · · T
µ
m1
J
m
, so
the right-hand side of Eq. (3.12) becomes 0 and the preceding proof yields, with some calculation,
kJ
π
J
*
k
2α
m
1 α
m
kJ
m
J
*
k +
δ
1 α
m
+
αδ(1 α
m1
)
(1 α)(1 α
m
)
=
2α
m
1 α
m
kJ
m
J
*
k +
δ
1 α
.
17
Generalized Value Iteration
We finally note that Prop. 3.2 shows that as m , the corresponding bound for kJ
π
J
*
k tends to
+ α( + 2δ)/(1 α), or
lim sup
m→∞
kJ
π
J
*
k
+ 2αδ
1 α
.
We will see that this error bound is superior to corresponding error bounds for approximate versions of VI
and PI by essentially a factor 1/(1α). This is an interesting fact, which was first shown by Scherrer [Sch12]
in the context of discounted MDP.
4. GENERALIZED VALUE ITERATION
Generalized value iteration (VI) is the algorithm that starts from some J B(X), and generates T J, T
2
J, . . ..
Since T is a weighted sup-norm contraction under Assumption 2.1, the algorithm converges to J
*
, and the
rate of convergence is governed by
kT
k
J J
*
k α
k
kJ J
*
k, k = 0, 1, . . . .
Similarly, for a given policy µ M, we have
kT
k
µ
J J
µ
k α
k
kJ J
µ
k, k = 0, 1, . . . .
From Prop. 2.1(d), we also have the error bound
kT
k+1
J J
*
k
α
1 α
kT
k+1
J T
k
Jk, k = 0, 1, . . . .
This bound does not rely on the Monotonicity Assumption 2.2.
Suppose now that we use generalized VI to compute an approximation
˜
J to J
*
, and then we obtain a
policy ¯µ by minimization of H(x, u,
˜
J) over u U (x) for each x X. In other words
˜
J and ¯µ satisfy
k
˜
J J
*
k γ, T
¯µ
˜
J = T
˜
J,
where γ is some positive scalar. Then, with an identical proof to Prop. 3.1(c), we have
J
¯µ
J
*
+
2α γ
1 α
v, (4.1)
which can be viewed as an error bound for the performance of a policy obtained by generic one-step lookahead.
We use this bound in the following proposition, which shows that if the set of policies is finite, then a
policy µ
with J
µ
= J
*
may be obtained after a finite number of VI.
Proposition 4.1: Let Assumption 2.1 hold and let J B(X). If the set of policies M is finite,
there exists an integer
¯
k 0 such that J
µ
= J
*
for all µ
and k
¯
k with T
µ
T
k
J = T
k+1
J.
Proof: Let
¯
M be the set of nonoptimal policies, i.e., all µ such that J
µ
6= J
*
. Since
¯
M is finite, we have
min
µ
¯
M
kJ
µ
J
*
k > 0,
so by Eq. (4.1), there exists sufficiently small β > 0 such that
k
˜
J J
*
k β and T
µ
˜
J = T
˜
J kJ
µ
J
*
k = 0 µ /
¯
M. (4.2)
It follows that if k is sufficiently large so that kT
k
J J
*
k β, then T
µ
T
k
J = T
k+1
J implies that µ
/
¯
M
so J
µ
= J
*
. Q.E.D.
18
Generalized Value Iteration
4.1 Approximate Value Iteration
Let us consider situations where the VI method may be implementable only through approximations. In
particular, given a function J, assume that we may only be able to calculate an approximation
˜
J to T J such
that
˜
J T J
δ, (4.3)
where δ is a given positive scalar. In the corresponding approximate VI method, we start from an arbitrary
bounded function J
0
, and we generate a sequence {J
k
} satisfying
kJ
k+1
T J
k
k δ, k = 0, 1, . . . . (4.4)
This approximation may be the result of representing J
k+1
compactly, as a linear combination of basis
functions, through a projection or aggregation process, as is common in approximate DP.
We may also simultaneously generate a sequence of policies {µ
k
} such that
kT
µ
k
J
k
T J
k
k , k = 0, 1, . . . , (4.5)
where is some scalar [which could be equal to 0, as in case of Eq. (3.8), considered earlier]. The following
proposition shows that the corresponding cost vectors J
µ
k
“converge” to J
*
to within an error of order
O
δ/(1 α)
2
[plus a less significant error of order O
/(1 α)
].
Proposition 4.2: (Error Bounds for Approximate VI) Let Assumption 2.1 hold. A sequence
{J
k
} generated by the approximate VI method (4.4)-(4.5) satisfies
lim sup
k→∞
kJ
k
J
*
k
δ
1 α
, (4.6)
while the corresponding sequence of policies {µ
k
} satisfies
lim sup
k→∞
kJ
µ
k
J
*
k
1 α
+
2αδ
(1 α)
2
. (4.7)
Proof: Arguing as in the proof of Prop. 3.2, we have
kJ
k
T
k
J
0
k
δ(1 α
k
)
1 α
, k = 0, 1, . . .
[cf. Eq. (3.10)]. By taking limit as k and by using the fact lim
k→∞
T
k
J
0
= J
*
, we obtain Eq. (4.6).
We also have using the triangle inequality and the contraction property of T
µ
k
and T ,
kT
µ
k
J
*
J
*
k kT
µ
k
J
*
T
µ
k
J
k
k + kT
µ
k
J
k
T J
k
k + kT J
k
J
*
k
αkJ
*
J
k
k + + αkJ
k
J
*
k,
19
Generalized Policy Iteration
while by using also Prop. 2.1(e), we obtain
kJ
µ
k
J
*
k
1
1 α
kT
µ
k
J
*
J
*
k
1 α
+
2α
1 α
kJ
k
J
*
k.
By combining this relation with Eq. (4.6), we obtain Eq. (4.7). Q.E.D.
The error bound (4.7) relates to stationary policies obtained from the functions J
k
by one-step looka-
head. We may also obtain an m-step periodic policy π from J
k
by using m-step lookahead. Then Prop. 3.2
shows that the corresponding bound for kJ
π
J
*
k tends to + 2αδ/(1 α) as m , which improves on
the error bound (4.7) by a factor 1/(1 α). This is a remarkable and surprising fact, which was first shown
by Scherrer [Sch12] in the context of discounted MDP.
Finally, let us note that the error bound of Prop. 4.2 is predicated upon generating a sequence {J
k
}
satisfying kJ
k+1
T J
k
k δ for all k [cf. Eq. (4.4)]. Unfortunately, some practical approximation schemes
guarantee the existence of such a δ only if {J
k
} is a bounded sequence. The following simple example from
[BeT96], Section 6.5.3, shows that boundedness of the iterates is not automatically guaranteed, and is a
serious issue that should be addressed in approximate VI schemes.
Example 4.1 (Error Amplification in Approximate Value Iteration)
Consider a two-state discounted MDP with states 1 and 2, and a single policy. The transitions are deterministic:
from state 1 to state 2, and from state 2 to state 2. These transitions are also cost-free. Thus we have
J
(1) = J
(2) = 0.
We consider a VI scheme that approximates cost functions within the one-dimensional subspace of linear
functions S =
(r, 2r) | r <
by using a weighted least squares minimization; i.e., we approximate a vector J
by its weighted Euclidean projection onto S. In particular, given J
k
= (r
k
, 2r
k
), we find J
k+1
= (r
k+1
, 2r
k+1
),
where for weights w
1
, w
2
> 0, r
k+1
is obtained as
r
k+1
= arg min
r
h
w
1
r (T J
k
)(1)
2
+ w
2
2r (T J
k
)(2)
2
i
.
Since for a zero cost per stage and the given deterministic transitions, we have T J
k
= (2αr
k
, 2αr
k
), the preceding
minimization is written as
r
k+1
= arg min
r
w
1
(r 2αr
k
)
2
+ w
2
(2r 2αr
k
)
2
,
which by writing the corresponding optimality condition yields r
k+1
= αβr
k
, where β = 2(w
1
+2w
2
)(w
1
+4w
2
) >
1. Thus if α > 1, the sequence {r
k
} diverges and so does {J
k
}. Note that in this example the optimal cost
function J
= (0, 0) belongs to the subspace S. The difficulty here is that the approximate VI mapping that
generates J
k+1
by a least squares-based approximation of T J
k
is not a contraction. At the same time there is
no δ such that kJ
k+1
T J
k
k δ for all k, because of error amplification in each approximate VI.
5. GENERALIZED POLICY ITERATION
In generalized policy iteration (PI), we maintain and update a policy µ
k
, starting from some initial policy
µ
0
. The (k + 1)st iteration has the following form.
20
Generalized Policy Iteration
Generalized Policy Iteration
Policy Evaluation: We compute J
µ
k
as the unique solution of the equation J
µ
k
= T
µ
k
J
µ
k
.
Policy Improvement: We obtain an improved policy µ
k+1
that satisfies T
µ
k+1
J
µ
k
= T J
µ
k
.
The algorithm requires the Monotonicity Assumption 2.2, in addition to the Contraction Assumption
2.1, so we assume these two conditions throughout this section. Moreover we assume that the minimum
of H(x, u, J
µ
k
) over u U(x) is attained for all x X, so that the improved policy µ
k+1
is defined. The
following proposition establishes a basic cost improvement property, as well as finite convergence for the case
where the set of policies is finite.
Proposition 5.1: (Convergence of Generalized PI) Let Assumptions 2.1 and 2.2 hold, and let
{µ
k
} be a sequence generated by the generalized PI algorithm. Then for all k, we have J
µ
k+1
J
µ
k
,
with equality if and only if J
µ
k
= J
*
. Moreover,
lim
k→∞
kJ
µ
k
J
*
k = 0,
and if the set of policies is finite, we have J
µ
k
= J
*
for some k.
Proof: We have
T
µ
k+1
J
µ
k
= T J
µ
k
T
µ
k
J
µ
k
= J
µ
k
.
Applying T
µ
k+1
to this inequality while using the Monotonicity Assumption 2.2, we obtain
T
2
µ
k+1
J
µ
k
T
µ
k+1
J
µ
k
= T J
µ
k
T
µ
k
J
µ
k
= J
µ
k
.
Similarly, we have for all m > 0,
T
m
µ
k+1
J
µ
k
T J
µ
k
J
µ
k
,
and by taking the limit as m , we obtain
J
µ
k+1
T J
µ
k
J
µ
k
, k = 0, 1, . . . . (5.1)
If J
µ
k+1
= J
µ
k
, it follows that T J
µ
k
= J
µ
k
, so J
µ
k
is a fixed point of T and must be equal to J
*
. Moreover
by using induction, Eq. (5.1) implies that
J
µ
k
T
k
J
µ
0
, k = 0, 1, . . . ,
Since
J
*
J
µ
k
, lim
k→∞
kT
k
J
µ
0
J
*
k = 0,
21
Generalized Policy Iteration
it follows that lim
k→∞
kJ
µ
k
J
*
k = 0. Finally, if the number of policies is finite, Eq. (5.1) implies that
there can be only a finite number of iterations for which J
µ
k+1
(x) < J
µ
k
(x) for some x, so we must have
J
µ
k+1
= J
µ
k
for some k, at which time J
µ
k
= J
*
as shown earlier. Q.E.D.
In the case where the set of policies is infinite, we may assert the convergence of the sequence of
generated policies under some compactness and continuity conditions. In particular, we will assume that
the state space is finite, X = {1, . . . , n}, and that each control constraint set U(x) is a compact subset of
<
m
. We will view a cost vector J as an element of <
n
, and a policy µ as an element of the compact set
U(1) × · · · × U (n) <
mn
. Then {µ
k
} has at least one limit point ¯µ, which must be an admissible policy.
The following proposition guarantees, under an additional continuity assumption for H(x, ·, ·), that every
limit point ¯µ is optimal.
Assumption 5.1: (Compactness and Continuity)
(a) The state space is finite, X = {1, . . . , n}.
(b) Each control constraint set U(x), x = 1, . . . , n, is a compact subset of <
m
.
(c) Each function H(x, ·, ·), x = 1, . . . , n, is continuous over U(x) × <
n
.
Proposition 5.2: Let Assumptions 2.1, 2.2, and 5.1 hold, and let {µ
k
} be a sequence generated
by the generalized PI algorithm. Then for every limit point ¯µ of {µ
k
}, we have J
¯µ
= J
.
Proof: We have J
µ
k
J
*
by Prop. 5.1. Let ¯µ be the limit of a subsequence {µ
k
}
k∈K
. We will show that
T
¯µ
J
*
= T J
*
, from which it follows that J
¯µ
= J
*
[cf. Prop. 2.1(c)]. Indeed, we have T
¯µ
J
*
T J
*
, so we focus
on showing the reverse inequality. From the equation T
µ
k
J
µ
k1
= T J
µ
k1
we have
H
x, µ
k
(x), J
µ
k1
H(x, u, J
µ
k1
), x = 1, . . . , n, u U (x).
By taking limit in this relation as k , k K, and by using the continuity of H(x, ·, ·) [cf. Assumption
5.1(c)], we obtain
H
x, ¯µ(x), J
*
H(x, u, J
*
), x = 1, . . . , n, u U (x).
By taking the minimum of the right-hand side over u U(x), we obtain T
¯µ
J
*
T J
*
. Q.E.D.
5.1 Approximate Policy Iteration
We now consider the PI method where the policy evaluation step and/or the policy improvement step of the
method are implemented through approximations. This method generates a sequence of policies {µ
k
} and a
corresponding sequence of approximate cost functions {J
k
} satisfying
kJ
k
J
µ
k
k δ, kT
µ
k+1
J
k
T J
k
k , k = 0, 1, . . . , (5.2)
22
Generalized Policy Iteration
where k · k denotes the sup-norm and v is the weight vector of the weighted sup-norm (it is important to use
v rather than the unit vector in the above equation, in order for the bounds obtained to have a clean form).
The following proposition provides an error bound for this algorithm, which extends a corresponding result
of [BeT96], shown for discounted MDP.
Proposition 5.3: (Error Bound for Approximate PI) Let Assumptions 2.1 and 2.2 hold. The
sequence {µ
k
} generated by the approximate PI algorithm (5.2) satisfies
lim sup
k→∞
kJ
µ
k
J
*
k
+ 2αδ
(1 α)
2
. (5.3)
The essence of the proof is contained in the following proposition, which quantifies the amount of
approximate policy improvement at each iteration.
Proposition 5.4: Let Assumptions 2.1 and 2.2 hold. Let J, ¯µ, and µ satisfy
kJ J
µ
k δ, kT
¯µ
J T Jk ,
where δ and are some scalars. Then
kJ
¯µ
J
*
k αkJ
µ
J
*
k +
+ 2αδ
1 α
. (5.4)
Proof: Using Eq. (5.4) and the contraction property of T and T
¯µ
, which implies that kT
¯µ
J
µ
T
¯µ
Jk αδ
and kT J T J
µ
k αδ, and hence T
¯µ
J
µ
T
¯µ
J + αδ v and T J T J
µ
+ αδ v, we have
T
¯µ
J
µ
T
¯µ
J + αδ v T J + ( + αδ) v T J
µ
+ ( + 2αδ) v. (5.5)
Since T J
µ
T
µ
J
µ
= J
µ
, this relation yields
T
¯µ
J
µ
J
µ
+ ( + 2αδ) v,
and applying Prop. 2.4(b) with µ = ¯µ, J = J
µ
, and = + 2αδ, we obtain
J
¯µ
J
µ
+
+ 2αδ
1 α
v. (5.6)
Using this relation, we have
J
¯µ
= T
¯µ
J
¯µ
= T
¯µ
J
µ
+ (T
¯µ
J
¯µ
T
¯µ
J
µ
) T
¯µ
J
µ
+
α( + 2αδ)
1 α
v,
23
Generalized Policy Iteration
where the inequality follows by using Prop. 2.3 and Eq. (5.6). Subtracting J
*
from both sides, we have
J
¯µ
J
*
T
¯µ
J
µ
J
*
+
α( + 2αδ)
1 α
v, (5.7)
Also by subtracting J
*
from both sides of Eq. (5.5), and using the contraction property
T J
µ
J
*
= T J
µ
T J
*
αkJ
µ
J
*
k v,
yields
T
¯µ
J
µ
J
*
T J
µ
J
*
+ ( + 2αδ) v αkJ
µ
J
*
k v + ( + 2αδ) v.
Combining this relation with Eq. (5.7), we obtain
J
¯µ
J
*
αkJ
µ
J
*
k v +
α( + 2αδ)
1 α
v + ( + αδ)e = αkJ
µ
J
*
k v +
+ 2αδ
1 α
v,
which is equivalent to the desired relation (5.4). Q.E.D.
Proof of Prop. 5.3: Applying Prop. 5.4, we have
kJ
µ
k+1
J
*
k αkJ
µ
k
J
*
k +
+ 2αδ
1 α
,
which by taking the lim sup of both sides as k yields the desired result. Q.E.D.
We note that the error bound of Prop. 5.3 is tight, as can be shown with an example from [BeT96],
Section 6.2.3. The error bound is comparable to the one for approximate VI, derived earlier in Prop. 4.2. In
particular, the error kJ
µ
k
J
*
k is asymptotically proportional to 1/(1 α)
2
and to the approximation error
in policy evaluation or value iteration, respectively. This is noteworthy, as it indicates that contrary to the
case of exact implementation, approximate PI need not hold a convergence rate advantage over approximate
VI, despite its greater overhead per iteration.
On the other hand, approximate PI does not have as much difficulty with the kind of iteration instability
that was illustrated by Example 4.1 for approximate VI. In particular, if the set of policies is finite, so that
the sequence {J
µ
k
} is guaranteed to be bounded, the assumption of Eq. (5.2) is not hard to satisfy in practice
with the cost function approximation methods to be discussed in Chapters 6 and 7.
Note that when δ = = 0, Eq. (5.4) yields
kJ
µ
k+1
J
*
k αkJ
µ
k
J
*
k.
Thus in the case of an infinite state space and/or control space, exact PI converges at a geometric rate
under the contraction and monotonicity assumptions of this section. This rate is the same as the rate of
convergence of exact VI.
24
Generalized Policy Iteration
The Case Where Policies Converge
Generally, the policy sequence {µ
k
} generated by approximate PI may oscillate between several policies.
However, under some circumstances this sequence may be guaranteed to converge to some ¯µ, in the sense
that
µ
¯
k+1
= µ
¯
k
= ¯µ for some
¯
k. (5.8)
An example arises when the policy sequence {µ
k
} is generated by exact PI applied with a different mapping
˜
H in place of H, but the bounds of Eq. (5.2) are satisfied. The mapping
˜
H may for example correspond to
an approximation of the original problem, as in aggregation methods. In this case we can show the following
bound, which is much more favorable than the one of Prop. 5.3.
Proposition 5.5: (Error Bound for Approximate PI when Policies Converge) Let As-
sumptions 2.1 and 2.2 hold, and let ¯µ be a policy generated by the approximate PI algorithm (5.2)
and satisfying condition (5.8). Then we have
kJ
¯µ
J
*
k
+ 2αδ
1 α
. (5.9)
Proof: Let
¯
J be the cost vector obtained by approximate policy evaluation of ¯µ [i.e.,
¯
J = J
¯
k
, where
¯
k
satisfies the condition (5.8)]. Then we have
k
¯
J J
¯µ
k δ, kT
¯µ
¯
J T
¯
Jk , (5.10)
where the latter inequality holds since we have
kT
¯µ
¯
J T
¯
Jk = kT
µ
¯
k+1
J
¯
k
T J
¯
k
k ,
cf. Eq. (5.2). Using Eq. (5.10) and the fact J
¯µ
= T
¯µ
J
¯µ
, we have
kT J
¯µ
J
¯µ
k kT J
¯µ
T
¯
Jk + kT
¯
J T
¯µ
¯
Jk + kT
¯µ
¯
J J
¯µ
k
= kT J
¯µ
T
¯
Jk + kT
¯
J T
¯µ
¯
Jk + kT
¯µ
¯
J T
¯µ
J
¯µ
k
αkJ
¯µ
¯
Jk + + αk
¯
J J
¯µ
k
+ 2αδ.
(5.11)
Using Prop. 2.1(d) with J = J
¯µ
, we obtain the error bound (5.9). Q.E.D.
The preceding error bound can be generalized to the case where two successive policies generated by
the approximate PI algorithm are “not too different” rather than being identical. In particular, suppose that
µ and ¯µ are successive policies, which in addition to
k
¯
J J
µ
k δ, kT
¯µ
¯
J T
¯
Jk ,
25
Optimistic Policy Iteration
[cf. Eq. (5.2)], also satisfy
kT
µ
¯
J T
¯µ
¯
Jk ζ,
where ζ is some scalar (instead of µ = ¯µ, which is the case where policies converge exactly). Then we also
have
kT
¯
J T
¯µ
¯
Jk kT
¯
J T
µ
¯
Jk + kT
µ
¯
J T
¯µ
¯
Jk + ζ,
and by replacing with + ζ in Eq. (5.11), we obtain
kJ
¯µ
J
*
k
+ ζ + 2αδ
1 α
.
When ζ is small enough to be of the order of max{δ, }, this error bound is comparable to the one for the
case where policies converge.
6. OPTIMISTIC POLICY ITERATION
In optimistic PI (also called “modified” PI, see e.g., [Put94]) each policy µ
k
is evaluated by solving the
equation J
µ
k
= T
µ
k
J
µ
k
approximately, using a finite number of VI. Thus, starting with a function J
0
B(X),
we generate sequences {J
k
} and {µ
k
} with the algorithm
T
µ
k
J
k
= T J
k
, J
k+1
= T
m
k
µ
k
J
k
, k = 0, 1, . . . , (6.1)
where {m
k
} is a sequence of positive integers.
A more general form of optimistic policy iteration, considered by Thiery and Scherrer [ThS10b], is
T
µ
k
J
k
= T J
k
, J
k+1
=
X
`=1
λ
`
T
`
µ
k
J
k
, k = 0, 1, . . . , (6.2)
where {λ
`
} is a sequence of nonnegative scalars such that
X
`=1
λ
`
= 1.
An example is the λ-policy iteration method (Bertsekas and Ioffe [BeI96], Thiery and Scherrer [ThS10a],
Bertsekas [Ber11], Scherrer [Sch11]), where λ
`
= (1 λ)λ
`1
, with λ being a scalar in (0, 1). For simplicity,
we will not discuss the more general type of algorithm (6.2) in this paper, but some of our results admit
straightforward extensions to this case, particularly the analysis of Section 6.2, and the SSP analysis of
Section 7.
6.1 Convergence of Optimistic Policy Iteration
The following two propositions provide the convergence properties of the algorithm (6.1). These propositions
have been proved by Rothblum [Rot79] within the framework of Denardo’s model [Der67], i.e., the case of an
unweighted sup-norm where v is the unit function; see also Canbolat and Rothblum [CaR11], which considers
optimistic PI methods where the minimization in the policy improvement (but not the policy evaluation)
operation is approximate, within some > 0. Our proof follows closely the one of Rothblum [Rot79].
26
Optimistic Policy Iteration
Proposition 6.1: (Convergence of Optimistic Generalized PI) Let Assumptions 2.1 and 2.2
hold, and let
(J
k
, µ
k
)
be a sequence generated by the optimistic generalized PI algorithm (6.1).
Then
lim
k→∞
kJ
k
J
*
k = 0,
and if the number of policies is finite, we have J
µ
k
= J
*
for all k greater than some index
¯
k.
Proposition 6.2: Let Assumptions 2.1, 2.2, and 5.1 hold, and let
(J
k
, µ
k
)
be a sequence gen-
erated by the optimistic generalized PI algorithm (6.1). Then every limit point ¯µ of {µ
k
}, satisfies
J
¯µ
= J
*
.
We develop the proofs of the propositions through four lemmas. The first lemma collects some generic
properties of monotone weighted sup-norm contractions, variants of which we have noted earlier, and we
restate for convenience.
Lemma 6.1: Let W : B(X) 7→ B(X) be a mapping that satisfies the monotonicity assumption
J J
0
W J W J
0
, J, J
0
B(X),
and the contraction assumption
kW J W J
0
k αkJ J
0
k, J, J
0
B(X),
for some α (0, 1).
(a) For all J, J
0
B(X) and scalar c 0, we have
J J
0
c v WJ W J
0
αc v. (6.3)
(b) For all J B(X), c 0, and k = 0, 1, . . ., we have
J W J c v W
k
J J
*
α
k
1 α
c v, (6.4)
W J J c v J
*
W
k
J
α
k
1 α
c v, (6.5)
where J
*
is the fixed point of W .
27
Optimistic Policy Iteration
Proof: Part (a) essentially follows from Prop. 2.3, while part (b) essentially follows from Prop. 2.4(c).
Q.E.D.
Lemma 6.2: Let Assumptions 2.1 and 2.2 hold, and let J B(X) and c 0 satisfy
J T J c v,
and let µ M be such that T
µ
J = T J. Then for all k > 0, we have
T J T
k
µ
J
α
1 α
c v, (6.6)
and
T
k
µ
J T (T
k
µ
J) α
k
c v. (6.7)
Proof: Since J T J c v = T
µ
J c v, by using Lemma 6.1(a) with W = T
j
µ
and J
0
= T
µ
J, we have for
all j 1,
T
j
µ
J T
j+1
µ
J α
j
c v. (6.8)
By adding this relation over j = 1, . . . , k 1, we have
T J = T
µ
J T
k
µ
J
k1
X
j=1
α
j
c v = T
k
µ
J
α α
k
1 α
c v T
k
µ
J
α
1 α
c v,
showing Eq. (6.6). From Eq. (6.8) for j = k, we obtain
T
k
µ
J T
k+1
µ
J α
k
c v = T
µ
(T
k
µ
J) α
k
c v T (T
k
µ
J) α
k
c v,
showing Eq. (6.7). Q.E.D.
The next lemma applies to the optimistic generalized PI algorithm (6.1) and proves a preliminary
bound.
Lemma 6.3: Let Assumptions 2.1 and 2.2 hold, let
(J
k
, µ
k
)
be a sequence generated by the PI
algorithm (6.1), and assume that for some c 0 we have
J
0
T J
0
c v.
Then for all k 0,
T J
k
+
α
1 α
β
k
c v J
k+1
T J
k+1
β
k+1
c v, (6.9)
28
Optimistic Policy Iteration
where β
k
is the scalar given by
β
k
=
1 if k = 0,
α
m
0
+···+m
k1
if k > 0,
(6.10)
with m
j
, j = 0, 1, . . ., being the integers used in the algorithm (6.1).
Proof: We prove Eq. (6.9) by induction on k, using Lemma 6.2. For k = 0, using Eq. (6.6) with J = J
0
,
µ = µ
0
, and k = m
0
, we have
T J
0
J
1
α
1 α
c v = J
1
α
1 α
β
0
c v,
showing the left-hand side of Eq. (6.9) for k = 0. Also by Eq. (6.7) with µ = µ
0
and k = m
0
, we have
J
1
T J
1
α
m
0
c v = T J
1
β
1
c v.
showing the right-hand side of Eq. (6.9) for k = 0.
Assuming that Eq. (6.9) holds for k 1 0, we will show that it holds for k. Indeed, the right-hand
side of the induction hypothesis yields
J
k
T J
k
β
k
c v.
Using Eqs. (6.6) and (6.7) with J = J
k
, µ = µ
k
, and k = m
k
, we obtain
T J
k
J
k+1
α
1 α
β
k
c v,
and
J
k+1
T J
k+1
α
m
k
β
k
c v = T J
k+1
β
k+1
c v,
respectively. This completes the induction. Q.E.D.
The next lemma essentially proves the convergence of the generalized optimistic PI (Prop. 6.1) and
provides associated error bounds.
Lemma 6.4: Let Assumptions 2.1 and 2.2 hold, let
(J
k
, µ
k
)
be a sequence generated by the PI
algorithm (6.1), and let c 0 be a scalar such that
kJ
0
T J
0
k c. (6.11)
Then for all k 0,
J
k
+
α
k
1 α
c v J
k
+
β
k
1 α
c v J
*
J
k
(k + 1)α
k
1 α
c v, (6.12)
where β
k
is defined by Eq. (6.10).
29
Optimistic Policy Iteration
Proof: Using the relation J
0
T J
0
c v [cf. Eq. (6.11)] and Lemma 6.3, we have
J
k
T J
k
β
k
c v, k = 0, 1, . . . .
Using this relation in Lemma 6.1(b) with W = T and k = 0, we obtain
J
k
J
*
β
k
1 α
c v,
which together with the fact α
k
β
k
, shows the left-hand side of Eq. (6.12).
Using the relation T J
0
J
0
c v [cf. Eq. (6.11)] and Lemma 6.1(b) with W = T , we have
J
*
T
k
J
0
α
k
1 α
c v, k = 0, 1, . . . . (6.13)
Using again the relation J
0
T J
0
c v in conjunction with Lemma 6.3, we also have
T J
j
J
j+1
α
1 α
β
j
c v, j = 0, . . . , k 1.
Applying T
kj1
to both sides of this inequality and using the monotonicity and contraction properties of
T
kj1
, we obtain
T
kj
J
j
T
kj1
J
j+1
α
kj
1 α
β
j
c v, j = 0, . . . , k 1,
cf. Lemma 6.1(a). By adding this relation over j = 0, . . . , k 1, and using the fact β
j
α
j
, it follows that
T
k
J
0
J
k
k1
X
j=0
α
kj
1 α
α
j
c v = J
k
kα
k
1 α
c v. (6.14)
Finally, by combining Eqs. (6.13) and (6.14), we obtain the right-hand side of Eq. (6.12). Q.E.D.
Proof of Props. 6.1 and 6.2: Let c be a scalar satisfying Eq. (6.11). Then the error bounds (6.12) show
that lim
k→∞
kJ
k
J
*
k = 0, i.e., the first part of Prop. 6.1. The second part (finite termination when the
number of policies is finite) follows similar to Prop. 5.1. The proof of Prop. 6.2 follows using the Compactness
and Continuity Assumption 5.1, and the convergence argument of Prop. 5.2. Q.E.D.
Convergence Rate Issues
Let us consider the convergence rate bounds of Lemma 6.4 for generalized optimistic PI, and write them in
the form
kJ
0
T J
0
k c J
k
(k + 1)α
k
1 α
c v J
*
J
k
+
α
m
0
+···+m
k
1 α
c v. (6.15)
We may contrast these bounds with the ones for generalized VI, where
kJ
0
T J
0
k c T
k
J
0
α
k
1 α
c v J
*
T
k
J
0
+
α
k
1 α
c v (6.16)
[cf. Prop. 2.4(c)].
30
Optimistic Policy Iteration
In comparing the bounds (6.15) and (6.16), we should also take into account the associated overhead
for a single iteration of each method: optimistic PI requires at iteration k a single application of T and
m
k
1 applications of T
µ
k
(each being less time-consuming than an application of T ), while VI requires a
single application of T . It can then be seen that the upper bound for optimistic PI is better than the one
for VI (same bound for less overhead), while the lower bound for optimistic PI is worse than the one for VI
(worse bound for more overhead). This suggests that the choice of the initial condition J
0
is important in
optimistic PI, and in particular it is preferable to have J
0
T J
0
(implying convergence to J
*
from above)
rather than J
0
T J
0
(implying convergence to J
*
from below). This is consistent with the results of other
works, which indicate that the convergence properties of the method are fragile when the condition J
0
T J
0
does not hold (see [WiB93], [BeT96], [BeY10a], [BeY10b], [YuB11]).
6.2 Approximate Optimistic Policy Iteration
We now consider error bounds for the case where the policy evaluation and policy improvement operations
are approximate, similar to the nonoptimistic PI case of Section 5.1. In particular, we consider a method
that generates a sequence of policies {µ
k
} and a corresponding sequence of approximate cost functions {J
k
}
satisfying
kJ
k
T
m
k
µ
k
J
k1
k δ, kT
µ
k+1
J
k
T J
k
k , k = 0, 1, . . . , (6.17)
[cf. Eq. (5.2)]. For example, we may compute (perhaps approximately, by simulation) the values T
m
k
µ
k
(x) for
a subset of states x, and use a least squares fit of these values to select J
k
from some parametric class of
functions.
We will prove the same error bound as for the nonoptimistic case, cf. Eq. (5.3). However, for this we
will need the following condition, which is stronger than the contraction and monotonicity conditions that
we have been using so far.
Assumption 6.1: (Semilinear Monotonic Contraction) For all J B(X) and µ M, the
functions T
µ
J and T J belong to B(X). Furthermore, for some α (0, 1), we have
(T
µ
J
0
)(x) (T
µ
J)(x)
v(x)
α sup
yX
J
0
(y) J(y)
v(y)
, J, J
0
B(X), µ M, x X. (6.18)
This assumption implies both the Contraction and Monotonicity Assumptions 2.1 and 2.2, as can be
easily verified. Moreover the assumption is satisfied in all of the discounted DP examples of Section 2, as
well as the SSP problem of the next section. It holds if T
µ
is a linear mapping involving a matrix with
nonnegative components that has spectral radius less than 1 (or more generally if T
µ
is the minimum or the
maximum of a finite number of such linear mappings).
For any function y B(X), let us use the notation
M(y) = sup
xX
y(x)
v(x)
. (6.19)
31
Optimistic Policy Iteration
Then the condition (6.18) can be written as
M(T
µ
J T
µ
J
0
) αM(J J
0
), J, J
0
B(X), µ M, (6.20)
and also implies the following multistep versions,
T
`
µ
J T
`
µ
J
0
α
`
M(J J
0
)v, M (T
`
µ
J T
`
µ
J
0
) α
`
M(J J
0
), J, J
0
B(X), µ M, ` 1, (6.21)
which can be proved by induction using Eq. (6.20). We have the following proposition, whose proof follows
closely the original one by Thiery and Scherrer [ThS10b], given for the case of a discounted MDP.
Proposition 6.3: (Error Bound for Optimistic Approximate PI) Let Assumption 6.1 hold.
Then the sequence {µ
k
} generated by the optimistic approximate PI algorithm (6.17) satisfies
lim sup
k→∞
kJ
µ
k
J
*
k
+ 2αδ
(1 α)
2
. (6.22)
Proof: Let us fix k 1 and for simplicity let us denote
J = J
k1
, J = J
k
,
µ = µ
k
, ¯µ = µ
k+1
, m = m
k
, ¯m = m
k+1
,
s = J
µ
T
m
µ
J, ¯s = J
¯µ
T
¯m
¯µ
J, t = T
m
µ
J J
*
,
¯
t = T
¯m
¯µ
J J
*
.
We have
J
µ
J
*
= J
µ
T
m
µ
J + T
m
µ
J J
*
= s + t. (6.23)
We will derive recursive relations for s and t, which will also involve the residual functions
r = T
µ
J J, ¯r = T
¯µ
J J.
We first obtain a relation between r and ¯r. We have
¯r = T
¯µ
J J
= (T
¯µ
J T
µ
J) + (T
µ
J J)
(T
¯µ
J T J) +
T
µ
J T
µ
(T
m
µ
J)
+ (T
m
µ
J J) +
T
m
µ
(T
µ
J) T
m
µ
J
v + αM(J T
m
µ
J)v + δv + α
m
M(T
µ
J J)v
( + δ)v + αδv + α
m
M(r)v,
where the first inequality follows from T
¯µ
J T J, and the second and third inequalities follow from Eqs.
(6.17) and (6.21). From this relation we have
M(¯r)
+ (1 + α)δ
+ βM (r),
32
Optimistic Policy Iteration
where β = α
m
. Taking lim sup as k in this relation, we obtain
lim sup
k→∞
M(r)
+ (1 + α)δ
1
ˆ
β
, (6.24)
where
ˆ
β = α
lim inf
k→∞
m
k
.
Next we derive a relation between s and r. We have
s = J
µ
T
m
µ
J
= T
m
µ
J
µ
T
m
µ
J
α
m
M(J
µ
J)v
α
m
1 α
M(T
µ
J J)v
=
α
m
1 α
M(r)v,
where the first inequality follows from Eq. (6.21) and the second inequality follows by using Prop. 2.4(b).
Thus we have M (s)
α
m
1α
M(r), from which by taking lim sup of both sides and using Eq. (6.24), we obtain
lim sup
k→∞
M(s)
ˆ
β
+ (1 + α)δ
(1 α)(1
ˆ
β)
. (6.25)
Finally we derive a relation between t,
¯
t, and r. We first note that
T J T J
*
αM(J J
*
)v
= αM(J T
m
µ
J + T
m
µ
J J
*
)v
αM(J T
m
µ
J)v + αM(T
m
µ
J J
*
)v
αδv + αM (t)v.
Using this relation, and Eqs. (6.17) and (6.21), we have
¯
t = T
¯m
¯µ
J J
*
= (T
¯m
¯µ
J T
¯m1
¯µ
J) + · · · + (T
2
¯µ
J T
¯µ
J) + (T
¯µ
J T J) + (T J T J
*
)
(α
¯m1
+ · · · + α)M(T
¯µ
J J)v + v + αδv + αM(t)v,
so finally
M(
¯
t)
α α
¯m
1 α
M(¯r) + ( + αδ) + αM (t).
By taking lim sup of both sides and using Eq. (6.24), it follows that
lim sup
k→∞
M(t)
(α
ˆ
β)
+ (1 + α)δ
(1 α)
2
(1
ˆ
β)
+
+ αδ
1 α
. (6.26)
We now combine Eqs. (6.23), (6.25), and (6.26). We obtain
lim sup
k→∞
M(J
µ
k
J
*
) lim sup
k→∞
M(s) + lim sup
k→∞
M(t)
ˆ
β
+ (1 + α)δ
(1 α)(1
ˆ
β)
+
(α
ˆ
β)
+ (1 + α)δ
(1 α)
2
(1
ˆ
β)
+
+ αδ
1 α
=
ˆ
β(1 α) + (α
ˆ
β)

+ (1 + α)δ
(1 α)
2
(1
ˆ
β)
+
+ αδ
1 α
=
α
+ (1 + α)δ
(1 α)
2
+
+ αδ
1 α
=
+ 2αδ
(1 α)
2
.
33
Stochastic Shortest Path Problems
This proves the result, since in view of J
µ
k
J
*
, we have M(J
µ
k
J
*
) = kJ
µ
k
J
*
k. Q.E.D.
Note that generally, optimistic PI with approximations is susceptible to the instability phenomenon
illustrated by Example 4.1. In particular, when m
k
= 1 for all k in Eq. (6.17), the method becomes essentially
identical to approximate VI. However, it appears that choices of m
k
that are significatly larger than 1 should
be helpful in connection with this difficulty. In particular, it can be verified that in Example 4.1, the method
converges to the optimal cost function if m
k
is sufficiently large.
A remarkable fact is that approximate VI, approximate PI, and approximate optimistic PI have very
similar error bounds (cf. Props. 4.2, 5.3, and 6.3). Approximate VI has a slightly better bound, but insignif-
icantly so in practical terms.
7. STOCHASTIC SHORTEST PATH PROBLEMS
The SSP problem is a total cost infinite horizon DP problem where:
(a) There is no discounting (α = 1).
(b) The state space is X = {0, 1, . . . , n} and we are given transition probabilities, denoted by
p
ij
(u) = P (x
k+1
= j | x
k
= i, u
k
= u), i, j X, u U(i).
(c) The control constraint set U(i) is a compact subset of a metric space for all i X.
(d) A cost g(i, u) is incurred when control u U(i) is selected at state i.
(e) State 0 is a special destination state, which is absorbing and cost-free, i.e.,
p
00
(u) = 1,
and for all u U(0), g(0, u) = 0.
We have assumed for convenience that the cost per stage does not depend on the successor state. This
amounts to using expected cost per stage in all calculations. In particular, if the cost of applying control u
at state i and moving to state j is ˜g(i, u, j), we use as cost per stage the expected cost
g(i, u) =
X
j=0,1,...,n
p
ij
(u)˜g(i, u, j),
and the subsequent analysis goes through with no change.
Since the destination 0 is cost-free and absorbing, the cost starting from 0 is zero for every policy.
Accordingly, for all cost functions, we ignore the component that corresponds to 0, and define
H(i, u, J) = g(i, u) +
n
X
j=1
p
ij
(u)J(j), i = 1, . . . , n, u U (i), J <
n
.
The mappings T and T
µ
are defined by
(T J)(i) = min
uU(i)
g(i, u) +
n
X
j=1
p
ij
(u)J(j)
, i = 1, . . . , n,
34
Stochastic Shortest Path Problems
(T
µ
J)(i) = g
i, µ(i)
+
n
X
j=1
p
ij
µ(i)
J(j), i = 1, . . . , n.
Note that H satisfies the Monotonicity Assumption 2.2.
We say that a policy µ is proper if, when using this policy, there is positive probability that the
destination will be reached after at most n stages, regardless of the initial state; i.e., if
ρ
µ
= max
i=1,...,n
P {x
n
6= 0 | x
0
= i, µ} < 1. (7.1)
It can be seen that µ is proper if and only if in the Markov chain corresponding to µ, each state i is connected
to the destination with a path of positive probability transitions.
Throughout this section we assume that all policies are proper. Without this assumption, the mapping
T need not be a contraction, as is well known (see [BeT91]). On the other hand, we have the following
proposition [see e.g., [BeT96], Prop. 2.2; earlier proofs were given by Veinott [Vei69] (who attributes the
result to A. J. Hoffman), and Tseng [Tse90]]. Note that while [BeT96] assumes that U(i) is finite for all i,
the proof given there, in conjunction with the results of [BeT91], extends to the more general case considered
here.
Proposition 7.1: Assume that all policies are proper. Then, there exists a vector v =
v(1), . . . , v(n)
with positive components such that the Contraction Assumption 2.1 holds, and the modulus of con-
traction is given by
α = max
i=1,...,n
v(i) 1
v(i)
.
There is a generalization of the preceding proposition to SSP problems with a destination 0 and a
countable number of other states, denoted 1, 2, . . .. Let v(i) be the maximum (over all policies) expected
number of stages up to termination, starting from state i. Then if v(i) is finite and bounded over i, the
mappings T and T
µ
are contraction mappings with respect to the weighted sup-norm with weight vector
v =
v(1), v(2), . . .
. The proof is similar to the proof of Prop. 7.1, and is given in [Ber12], Section 3.5 and
Exercise 2.11.
We finally note that the weighted sup-norm contraction property of Prop. 7.1 is algorithmically signif-
icant, because it brings to bear the preceding algorithms and analysis. In particular, the error bounds of
Sections 4 and 5 for approximate VI and PI are valid when specialized to SSP problems, and the optimistic
PI method of Section 6.1 is convergent. We provide the corresponding analysis in the next two subsections.
7.1 Approximate Policy Iteration
Consider an approximate PI algorithm that generates a sequence of stationary policies {µ
k
} and a corre-
sponding sequence of approximate cost vectors {J
k
} satisfying for all k
kJ
k
J
µ
k
k δ, kT
µ
k+1
J
k
T J
k
k , (7.2)
35
Stochastic Shortest Path Problems
where δ and are some positive scalars, and k · k is the weighted sup-norm
kJk = max
i=1,...,n
J(i)
v(i)
.
The following proposition provides error bounds that are special cases of the ones of Props. 5.3 and
5.5.
Proposition 7.2: (Error Bound for Approximate PI) The sequence {µ
k
} generated by the
approximate PI algorithm (7.2) satisfies
kJ
µ
k+1
J
*
k αkJ
µ
k
J
*
k +
+ 2αδ
1 α
, (7.3)
and
lim sup
k→∞
kJ
µ
k
J
*
k
+ 2αδ
(1 α)
2
, (7.4)
where k · k is the weighted sup-norm of Prop. 7.1, and α is the associated contraction modulus.
Moreover, when {µ
k
} converges to some ¯µ, in the sense that
µ
¯
k+1
= µ
¯
k
= ¯µ for some
¯
k,
we have
kJ
¯µ
J
*
k
+ 2αδ
1 α
.
The standard result in the literature for the approximate PI method of this section has the form
lim sup
k→∞
max
i=1,...,n
J
µ
k
(x) J
*
(x)
n(1 α + n)( + 2δ)
(1 ρ)
2
, (7.5)
where ρ is defined as the maximal (over all initial states and policies) probability of the Markov chain not
having terminated after n transitions (see [BeT96], Prop. 6.3). While the bounds (7.4) and (7.5) involve
different norms (one is weighted and the other is unweighted) and different denominators, the error bound
(7.4) seems stronger, particularly for large n.
7.2 Optimistic Policy Iteration
Consider the optimistic PI method, whereby starting with a vector J
0
<
n
, we generate sequences {J
k
}
and {µ
k
} with the algorithm
T
µ
k
J
k
= T J
k
, J
k+1
= T
m
k
µ
k
J
k
, k = 0, 1, . . . , (7.6)
where {m
k
} is a sequence of positive integers. Then the convergence result and convergence rate estimates
of Section 6 hold. In particular, we have the following.
36
Stochastic Shortest Path Problems
Proposition 7.3: (Convergence of Optimistic PI) Assume that all stationary policies are
proper, and let
(J
k
, µ
k
)
be a sequence generated by the optimistic PI algorithm (7.6), and assume
that for some c 0 we have
kJ
0
T J
0
k c,
where k · k is the weighted sup-norm of Prop. 7.1. Then for all k 0,
J
k
+
α
k
1 α
c v J
k
+
β
k
1 α
c v J
*
J
k
(k + 1)α
k
1 α
c v, (7.7)
where v and α are the weight vector and the contraction modulus of Prop. 7.1, and β
k
is defined by
β
k
=
1 if k = 0,
α
m
0
+···+m
k1
if k > 0,
(7.8)
with m
j
, j = 0, 1, . . ., being the integers used in the algorithm (7.6). Moreover, we have
lim
k→∞
kJ
k
J
*
k = 0,
and J
µ
k
= J
*
for all k greater than some index
¯
k.
To our knowledge, this is the first convergence result for optimistic PI applied to SSP problems (earlier
results by Williams and Baird [WiB93] for discounted MDP, may be easily extended to SSP, but require
restrictive conditions, such as T J
0
J
0
). A similar result can be obtained when the mappings T and
T
µ
are replaced in the algorithm (7.6) by any monotone mappings W and W
µ
that are contractions with
respect to a common weighted sup-norm, and have J
*
and J
µ
as their unique fixed points, respectively.
This latter property is true in particular if W is the Gauss-Seidel mapping based on T [this is the mapping
where (W J)(i) is computed by the same equation as (T J)(i) except that the previously calculated values
(W J)(1), . . . , (W J)(i 1) are used in place of J(1), . . . , J(i 1)].
7.3 Approximate Optimistic Policy Iteration
Consider an optimistic approximate PI algorithm that generates a sequence of stationary policies {µ
k
} and
a corresponding sequence of approximate cost vectors {J
k
} satisfying for all k
kJ
k
T
m
k
µ
k
J
k1
k δ, kT
µ
k+1
J
k
T J
k
k , k = 0, 1, . . . , (7.9)
[cf. Eq. (6.17)]. The following proposition provides an error bound that is a special case of the one of Prop.
6.3 (the Semilinear Monotonic Contraction 6.1 is clearly satisfied under our assumptions).
37
Conclusions
Proposition 7.4: (Error Bound for Optimistic Approximate PI) The sequence {µ
k
} gener-
ated by the approximate PI algorithm (7.9) satisfies
lim sup
k→∞
kJ
µ
k
J
*
k
+ 2αδ
(1 α)
2
. (7.10)
8. CONCLUSIONS
We have considered an abstract and broadly applicable DP model based on weighted sup-norm contractions,
and provided a review of classical results and extensions to approximation methods that are the focus
of current research. By virtue of its abstract character, the analysis provides insight into fundamental
convergence properties and error bounds of exact and approximate VI and PI algorithms. Moreover, it
allows simple proofs of results that would be difficult and/or tedious to obtain by alternative methods. The
power of our analysis was illustrated by its application to SSP problems, where substantial improvements of
the existing results on PI algorithms were obtained.
9. REFERENCES
[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control: The Discrete Time Case, Academic
Press, N. Y.; may be downloaded from
http://web.mit.edu/dimitrib/www/home.html
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic Shortest Path Problems,” Math.
Operations Research, Vol. 16, pp. 580-595.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming, Athena Scientific, Belmont, MA.
[BeY10a] Bertsekas, D. P., and Yu, H., 2010. “Q-Learning and Enhanced Policy Iteration in Discounted Dynamic
Programming,” Lab. for Information and Decision Systems Report LIDS-P-2831, MIT; to appear in Mathematics of
Operations Research.
[BeY10b] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy Iteration in Dynamic Programming,”
Proc. of Allerton Conf. on Information Sciences and Systems.
[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dynamic Programming,” SIAM J. on
Control and Optimization, Vol. 15, pp. 438-464.
[Ber07] Bertsekas, D. P., 2007. Dynamic Programming and Optimal Control, 3rd Edition, Vol. II, Athena Scientific,
Belmont, MA.
[Ber11] Bertsekas, D. P., 2011. λ-Policy Iteration: A Review and a New Implementation,” Lab. for Information
and Decision Systems Report LIDS-P-2874, MIT; to appear in Reinforcement Learning and Approximate Dynamic
Programming for Feedback Control, by F. Lewis and D. Liu (eds.), IEEE Press Computational Intelligence Series.
38
Conclusions
[CaR11] Canbolat, P. G., and Rothblum, U. G., 2011. “(Approximate) Iterated Successive Approximations Algo-
rithm for Sequential Decision Processes,” Technical Report, The Technion - Israel Institute of Technology; Annals of
Operations Research, to appear.
[Den67] Denardo, E. V., 1967. “Contraction Mappings in the Theory Underlying Dynamic Programming,” SIAM
Review, Vol. 9, pp. 165-177.
[Har72] Harrison, J. M., 1972. “Discrete Dynamic Programming with Unbounded Rewards,” Ann. Math. Stat., Vol.
43, pp. 636-644.
[Lip73] Lippman, S. A., 1973. “Semi-Markov Decision Processes with Unbounded Rewards,” Management Sci., Vol.
21, pp. 717-731.
[Lip75] Lippman, S. A., 1975. “On Dynamic Programming with Unbounded Rewards,” Management Sci., Vol. 19,
pp. 1225-1233.
[Put94] Puterman, M. L., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming, J. Wiley,
N.Y.
[Rot79] Rothblum, U. G., 1979. “Iterated Successive Approximation for Sequential Decision Processes,” in Stochastic
Control and Optimization, by J. W. B. van Overhagen and H. C. Tijms (eds), Vrije University, Amsterdam.
[Sch11] Scherrer, B., 2011. “Performance Bounds for λ-Policy Iteration and Application to the Game of Tetris,”
INRIA Lorraine Report, France.
[Sch12] Scherrer, B., 2012. “On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision
Processes,” INRIA Lorraine Report, France.
[ThS10a] Thiery, C., and Scherrer, B., 2010. “Least-Squares λ-Policy Iteration: Bias-Variance Trade-off in Control
Problems,” in ICML’10: Proc. of the 27th Annual International Conf. on Machine Learning.
[ThS10b] Thiery, C., and Scherrer, B., 2010. “Performance Bound for Approximate Optimistic Policy Iteration,”
Technical Report, INRIA.
[Tse90] Tseng, P., 1990. “Solving H-Horizon, Stationary Markov Decision Problems in Time Proportional to log(H),”
Operations Research Letters, Vol. 9, pp. 287-297.
[Vei69] Veinott, A. F., Jr., 1969. “Discrete Dynamic Programming with Sensitive Discount Optimality Criteria,” Ann.
Math. Statist., Vol. 40, pp. 1635-1660.
[VeP84] Verd’u, S., and Poor, H. V., 1984. “Backward, Forward, and Backward-Forward Dynamic Programming
Models under Commutativity Conditions,” Proc. 1984 IEEE Decision and Control Conference, Las Vegas, NE, pp.
1081-1086.
[VeP87] Verd’u, S., and Poor, H. V., 1987. “Abstract Dynamic Programming Models under Commutativity Condi-
tions,” SIAM J. on Control and Optimization, Vol. 25, pp. 990-1006.
[WiB93] Williams, R. J., and Baird, L. C., 1993. “Analysis of Some Incremental Variants of Policy Iteration: First
Steps Toward Understanding Actor-Critic Learning Systems,” Report NU-CCS-93-11, College of Computer Science,
Northeastern University, Boston, MA.
[YuB11] Yu, H., and Bertsekas, D. P., 2011. “Q-Learning and Policy Iteration Algorithms for Stochastic Shortest
Path Problems,” Lab. for Information and Decision Systems Report LIDS-P-2871, MIT; to appear in Annals of OR.
[YuB12] Yu, H., and Bertsekas, D. P., 2012. “Weighted Bellman Eqations and their Applications in Dynamic Pro-
gramming,” Lab. for Information and Decision Systems Report LIDS-P-2876, MIT.
39