Weighted Sup-Norm Contractions in Dynamic Programming: A

May 2012 Report LIDS - 2884

Weighted Sup-Norm Contractions in Dynamic Programming:

A Review and Some New Applications

Dimitri P. Bertsekas†

Abstract

We consider a class of generalized dynamic programming models based on weighted sup-norm contrac-

tions. We provide an analysis that parallels the one available for discounted MDP and for generalized models

based on unweighted sup-norm contractions. In particular, we discuss the main properties and associated

algorithms of these models, including value iteration, policy iteration, and their optimistic and approximate

variants. The analysis relies on several earlier works that use more specialized assumptions. In particular,

we review and extend the classical results of Denardo [Den67] for unweighted sup-norm contraction models,

as well as more recent results relating to approximation methods for discounted MDP. We also apply the

analysis to stochastic shortest path problems where all policies are assumed proper. For these problems we

extend three results that are known for discounted MDP. The ﬁrst relates to the convergence of optimistic

policy iteration and extends a result of Rothblum [Rot79], the second relates to error bounds for approxi-

mate policy iteration and extends a result of Bertsekas and Tsitsiklis [BeT96], and the third relates to error

bounds for approximate optimistic policy iteration and extends a result of Thiery and Scherrer [ThS10b].

† Dimitri Bertsekas is with the Dept. of Electr. Engineering and Comp. Science, M.I.T., Cambridge, Mass., 02139.

His research was supported by NSF Grant ECCS-0801549, and by the Air Force Grant FA9550-10-1-0412.

A Weighted Sup-Norm Contraction Framework for DP

1. INTRODUCTION

Two key structural properties of total cost dynamic programming (DP) models are responsible for most of

the mathematical results one can prove about them. The ﬁrst is the monotonicity property of the mappings

associated with Bellman’s equation. In many models, however, these mappings have another property

that strengthens the eﬀects of monotonicity: they are contraction mappings with respect to a sup-norm,

unweighted in many models such as discounted ﬁnite spaces Markovian decision problems (MDP), but also

weighted in some other models, discounted or undiscounted. An important case of the latter are stochastic

shortest path (SSP) problems under certain conditions to be discussed in Section 7.

The role of contraction mappings in discounted DP was ﬁrst recognized and exploited by Shapley

[Sha53], who considered two-player dynamic games. Since that time the underlying contraction properties

of discounted DP problems have been explicitly or implicitly used by most authors that have dealt with the

subject. An abstract DP model, based on unweighted sup-norm contraction assumptions, was introduced

in an important paper by Denardo [Den67]. This model provided generality and insight into the principal

analytical and algorithmic ideas underlying the discounted DP research up to that time. Denardo’s model

motivated a related model by the author [Ber77], which relies only on monotonicity properties, and not

on contraction assumptions. These two models were used extensively in the book by Bertsekas and Shreve

[BeS78] for the analysis of both discounted and undiscounted DP problems, ranging over MDP, minimax,

risk sensitive, Borel space models, and models based on outer integration. Related analysis, motivated by

problems in communications, was given by Verd’u and Poor [VeP84], [VeP87]. See also Bertsekas and Yu

[BeY10b], which considers policy iteration methods using the abstract DP model of [Ber77].

In this paper, we extend Denardo’s model to weighted sup-norm contractions, and we provide a full

set of analytical and algorithmic results that parallel the classical ones for ﬁnite-spaces discounted MDP,

as well as some of the corresponding results for unweighted sup-norm contractions. These results include

extensions of relatively recent research on approximation methods, which have been shown for discounted

MDP with bounded cost per stage. Our motivation stems from the fact that there are important discounted

DP models with unbounded cost per stage, as well as undiscounted DP models of the SSP type, where there

is contraction structure that requires, however, a weighted sup-norm. We obtain among others, three new

algorithmic results for SSP problems, which are given in Section 7. The ﬁrst relates to the convergence of

optimistic (also commonly referred to as “modiﬁed” [Put94]) policy iteration, and extends the one originally

proved by Rothblum [Rot79] within Denardo’s unweighted sup-norm contraction framework. The second

relates to error bounds for approximate policy iteration, and extends a result of Bertsekas and Tsitsiklis

[BeT96] (Prop. 6.2), given for discounted MDP, and improves on another result of [BeT96] (Prop. 6.3) for

SSP. The third relates to error bounds for approximate optimistic policy iteration, and extends a result of

Thiery and Scherrer [ThS10a], [ThS10b], given for discounted MDP. A recently derived error bound for a

Q-learning framework for optimistic policy iteration in SSP problems, due to Yu and Bertsekas [YuB11], can

also be proved using our framework.

2. A WEIGHTED SUP-NORM CONTRACTION FRAMEWORK FOR DP

Let X and U be two sets, which in view of connections to DP that will become apparent shortly, we will

loosely refer to as a set of “states” and a set of “controls.” For each x ∈ X, let U(x) ⊂ U be a nonempty

subset of controls that are feasible at state x. Consistent with the DP context, we refer to a function

µ : X 7→ U with µ(x) ∈ U (x), for all x ∈ X, as a “policy.” We denote by M the set of all policies.

A Weighted Sup-Norm Contraction Framework for DP

Let R(X) be the set of real-valued functions J : X 7→ <, and let H : X × U × R(X) 7→ < be a given

mapping. We consider the mapping T deﬁned by

(T J)(x) = inf

u∈U(x)

H(x, u, J), ∀ x ∈ X.

We assume that (T J)(x) > −∞ for all x ∈ X, so that T maps R(X) into R(X). For each policy µ ∈ M, we

consider the mapping T

: R(X) 7→ R(X) deﬁned by

J)(x) = H



x, µ(x), J



, ∀ x ∈ X.

We want to ﬁnd a function J

∈ R(X) such that

(x) = inf

u∈U(x)

H(x, u, J

), ∀ x ∈ X,

i.e., ﬁnd a ﬁxed point of T . We also want to obtain a policy µ

∗

such that T

∗

= T J

Note that in view of the preceding deﬁnitions, H may be alternatively deﬁned by ﬁrst specifying T

for all µ ∈ M [for any (x, u, J), H(x, u, J) is equal to (T

J)(x) for any µ such µ(x) = u]. Moreover T may

be deﬁned by

(T J)(x) = inf

µ∈M

J)(x), ∀ x ∈ X, J ∈ R(X).

We give a few examples.

Example 2.1 (Discounted DP Problems)

Consider an α-discounted total cost DP problem. Here

H(x, u, J) = E



g(x, u, w) + αJ



f(x, u, w)



where α ∈ (0, 1), g is a uniformly bounded function representing cost per stage, w is random with distribution

that may depend on (x, u), and is taken with respect to that distribution. The equation J = T J, i.e.,

J(x) = inf

u∈U(x)

H(x, u, J) = inf

u∈U(x)



g(x, u, w) + αJ



f(x, u, w)



, ∀ x ∈ X,

is Bellman’s equation, and it is known to have unique solution J

∗

. Variants of the above mapping H are

H(x, u, J) = min



V (x), E



g(x, u, w) + αJ



f(x, u, w)



and

H(x, u, J) = E



g(x, u, w) + α min





f(x, u, w)



, J



f(x, u, w)



where V is a known function that satisﬁes V (x) ≥ J

∗

(x) for all x ∈ X. While the use of V in these variants

of H does not aﬀect the solution J

∗

, it may aﬀect favorably the value and policy iteration algorithms to be

discussed in subsequent sections.

Example 2.2 (Discounted Semi-Markov Problems)

With x, y, u as in Example 2.1, consider the mapping

H(x, u, J) = G(x, u) +

y=1

(u)J(y),

where G is some function representing cost per stage, and m

(u) are nonnegative numbers with

y=1

(u) <

1 for all x ∈ X and u ∈ U(x). The equation J = T J is Bellman’s equation for a continuous-time semi-Markov

decision problem, after it is converted into an equivalent discrete-time problem.

A Weighted Sup-Norm Contraction Framework for DP

Example 2.3 (Minimax Problems)

Consider a minimax version of Example 2.1, where an antagonistic player chooses v from a set V (x, u), and let

H(x, u, J) = sup

v∈V (x,u)



g(x, u, v) + αJ



f(x, u, v)



Then the equation J = T J is Bellman’s equation for an inﬁnite horizon minimax DP problem. A generalization

is a mapping of the form

H(x, u, J) = sup

v∈V (x,u)



g(x, u, v, w) + αJ



f(x, u, v, w)



where w is random with given distribution, and the expected value is with respect to that distribution. This

form appears in zero-sum sequential games [Sha53].

Example 2.4 (Deterministic and Stochastic Shortest Path Problems)

Consider a classical deterministic shortest path problem involving a graph of n nodes x = 1, . . . , n, plus a

destination 0, an arc length a

for each arc (x, u), and the mapping

H(x, u, J) =



+ J(u) if u 6= 0,

if u = 0,

x = 1, . . . , n, u = 0, 1, . . . , n.

Then the equation J = T J is Bellman’s equation for the shortest distances J

∗

(x) from the nodes x to node 0.

A generalization is a mapping of the form

H(x, u, J) = p

(u)g(x, u, 0) +

y=1

(u)



g(x, u, y) + J(y)



, x = 1, . . . , n.

It corresponds to a SSP problem, which is described in Section 7. A special case is stochastic ﬁnite-horizon,

ﬁnite-state DP problems.

Example 2.5 (Q-Learning I)

Consider the case where X is the set of state-control pairs (i, w), i = 1, . . . , n, w ∈ W (i), of an MDP with

controls w taking values at state i from a ﬁnite set W (i). Let T

map a Q-factor vector

Q =



Q(i, w) | i = 1, . . . , n, w ∈ W (i)



into the Q-factor vector



(i, w) | i = 1, . . . , n, w ∈ W (i)



with components given by

(i, w) = g(i, w) + α

j=1



µ(i)



min

v∈W (j)

Q(j, v), i = 1, . . . , n, w ∈ W (i).

This mapping corresponds to the classical Q-learning mapping of a ﬁnite-state MDP [in relation to the stan-

dard Q-learning framework, [Tsi94], [BeT96], [SuB98], µ applies a control µ(i) from the set U(i, w) = W (i)

independently of the value of w ∈ W (i)]. If α ∈ (0, 1), the MDP is discounted, while if α = 1, the MDP is

undiscounted and when there is a cost-free and absorbing state, it has the character of the SSP problem of the

preceding example.

A Weighted Sup-Norm Contraction Framework for DP

Example 2.6 (Q-Learning II)

Consider an alternative Q-learning framework introduced in [BeY10a] for discounted MDP and in [YuB11] for

SSP, where T

operates on pairs (Q, V ), and using the notation of the preceding example, Q is a Q-factor and

V is a cost vector of the forms



Q(i, w) | i = 1, . . . , n, w ∈ W (i)





V (i) | i = 1, . . . , n



Let T

map a pair (Q, V ) into the pair (

) with components given by

(i, w) = g(i, w) + α

j=1



µ(i)



ν(v | j) min



V (j), Q(j, v)



, i = 1, . . . , n, w ∈ W (i),

(i) = min

w∈W (i)

(i, w), i = 1, . . . , n,

where ν(· | j) is a given conditional distribution over W (j), and α ∈ (0, 1) for a discounted MDP and α = 1 for

an SSP problem.

We also note a variety of discounted countable-state MDP models with unbounded cost per stage,

whose Bellman equation mapping involves a weighted sup-norm contraction. Such models are described in

several sources, starting with works of Harrison [Har72], and Lippman [Lip73], [Lip75] (see also [Ber12],

Section 1.5, and [Put94], and the references quoted there).

Consider a function v : X 7→ < with

v(x) > 0, ∀ x ∈ X,

denote by B(X) the space of real-valued functions J on X such that J(x)/v(x) is bounded as x ranges over

X, and consider the weighted sup-norm

kJk = sup

x∈X



J(x)



v(x)

on B(X). We will use the following assumption.

Assumption 2.1: (Contraction) For all J ∈ B(X) and µ ∈ M, the functions T

J and T J belong

to B(X). Furthermore, for some α ∈ (0, 1), we have

J − T

k ≤ αkJ − J

k, ∀ J, J

∈ B(X), µ ∈ M. (2.1)

An equivalent way to state the condition (2.1) is



H(x, u, J) − H(x, u, J

)



v(x)

≤ αkJ − J

k, ∀ x ∈ X, u ∈ U (x), J, J

∈ B(X).

Note that Eq. (2.1) implies that

kT J − T J

k ≤ αkJ − J

k, ∀ J, J

∈ B(X). (2.2)

A Weighted Sup-Norm Contraction Framework for DP

To see this we write

J)(x) ≤ (T

)(x) + αkJ − J

k v(x), ∀ x ∈ X,

from which, by taking inﬁmum of both sides over µ ∈ M, we have

(T J)(x) − (T J

)(x)

v(x)

≤ αkJ − J

k, ∀ x ∈ X.

Reversing the roles of J and J

, we also have

(T J

)(x) − (T J)(x)

v(x)

≤ αkJ − J

k, ∀ x ∈ X,

and combining the preceding two relations, and taking the supremum of the left side over x ∈ X, we obtain

Eq. (2.2).

It can be seen that the Contraction Assumption 2.1 is satisﬁed for the mapping H in Examples 2.1-

2.3, and the discounted cases of 2.5-2.6, with v equal to the unit function, i.e., v(x) ≡ 1. Generally, the

assumption is not satisﬁed in Example 2.4, and the undiscounted cases of Examples 2.5-2.6, but it will be

seen later that it is satisﬁed for the special case of the SSP problem under the assumption that all stationary

policies are proper (lead to the destination with probability 1, starting from every state). In that case,

however, we cannot take v(x) ≡ 1, and this is one of our motivations for considering the more general case

where v is not the unit function.

The next two examples show how starting with mappings satisfying the contraction assumption, we

can obtain multistep mappings with the same ﬁxed points and a stronger contraction modulus. For any

J ∈ R(X), we denote by T

· · · T

J the composition of the mappings T

, . . . , T

applied to J, i.e,

· · · T

J =



· · · (T

k−1

J)) · · ·



Example 2.7 (Multistep Mappings)

Consider a set of mappings T

: <

7→ <

, µ ∈ M, satisfying Assumption 2.1, let m be a positive integer,

and let

M be the set of m-tuples ν = (µ

, . . . , µ

m−1

), where µ

∈ M, k = 1, . . . , m − 1. For each ν =

(µ

, . . . , µ

m−1

) ∈

M, deﬁne the mapping

, by

J = T

· · · T

m−1

J, ∀ J ∈ B(X).

Then we have the contraction properties

J −

k ≤ α

kJ − J

k, ∀ J, J

∈ B(X),

and

T J −

T J

k ≤ α

kJ − J

k, ∀ J, J

∈ B(X),

where

T is deﬁned by

(

T J)(x) = inf

(µ

,...,µ

m−1

)∈

· · · T

m−1

J)(x), ∀ J ∈ B(X), x ∈ X.

Thus the mappings

, ν ∈

M, satisfy Assumption 2.1, and have contraction modulus α

The following example considers mappings underlying weighted Bellman equations that arise in various

computational contexts in approximate DP; see Yu and Bertsekas [YuB12] for analysis, algorithms, and

related applications.

A Weighted Sup-Norm Contraction Framework for DP

Example 2.8 (Weighted Multistep Mappings)

Consider a set of mappings L

: B(X) 7→ B(X), µ ∈ M, satisfying Assumption 2.1, i.e., for some α ∈ (0, 1),

J − L

k ≤ αkJ − J

k, ∀ J, J

∈ B(X), µ ∈ M.

Consider also the mappings T

: B(X) 7→ B(X) deﬁned by

J)(x) =

∞

`=1

(x)(L

J)(x), x ∈ X, J ∈ <

where w

(x) are nonnegative scalars such that for all x ∈ X,

∞

`=1

(x) = 1.

Then it follows that

J − T

k ≤

∞

`=1

(x)α

kJ − J

showing that T

is a contraction with modulus

¯α = max

x∈X

∞

`=1

(x) α

≤ α.

Moreover L

and T

have a common ﬁxed point for all µ ∈ M, and the same is true for the corresponding

mappings L and T .

We will now consider some general questions, ﬁrst under the Contraction Assumption 2.1, and then

under an additional monotonicity assumption. Most of the results of this section are straightforward ex-

tensions of results that appear in Denardo’s paper [Den67] for the case where the sup-norm is unweighted

[v(x) ≡ 1].

2.1 Basic Results Under the Contraction Assumption

The contraction property of T

and T can be used to show the following proposition.

Proposition 2.1: Let Assumption 2.1 hold. Then:

(a) The mappings T

and T are contraction mappings with modulus α over B(X), and have unique

ﬁxed points in B(X), denoted J

and J

, respectively.

(b) For any J ∈ B(X) and µ ∈ M,

lim

k→∞

J = J

, lim

k→∞

J = J

A Weighted Sup-Norm Contraction Framework for DP

= T J

if and only if J

= J

(d) For any J ∈ B(X),

− Jk ≤

1 − α

kT J − Jk, kJ

− T Jk ≤

1 − α

kT J − Jk.

(e) For any J ∈ B(X) and µ ∈ M,

− Jk ≤

1 − α

J − Jk, kJ

− T

Jk ≤

1 − α

J − Jk.

Proof: We have already shown that T

and T are contractions with modulus α over B(X) [cf. Eqs. (2.1)

and (2.2)]. Parts (a) and (b) follow from the classical contraction mapping ﬁxed point theorem. To show part

(c), note that if T

= T J

, then in view of T J

= J

, we have T

= J

, which implies that J

= J

since J

is the unique ﬁxed point of T

. Conversely, if J

= J

, we have T

= T

= J

= T J

To show part (d), we use the triangle inequality to write for every k,

J − Jk ≤

`=1

J − T

`−1

Jk ≤

`=1

`−1

kT J − Jk.

Taking the limit as k → ∞ and using part (b), the left-hand side inequality follows. The right-hand side

inequality follows from the left-hand side and the contraction property of T . The proof of part (e) is similar

to part (d) [indeed part (e) is the special case of part (d) where T is equal to T

, i.e., when U(x) =



µ(x)



for all x ∈ X]. Q.E.D.

Part (c) of the preceding proposition shows that there exists a µ ∈ M such that J

= J

if and only if

the minimum of H(x, u, J

) over U(x) is attained for all x ∈ X. Of course the minimum is attained if U(x)

is ﬁnite for every x, but otherwise this is not guaranteed in the absence of additional assumptions. Part (d)

provides a useful error bound: we can evaluate the proximity of any function J ∈ B(X) to the ﬁxed point

by applying T to J and computing kT J − Jk. The left-hand side inequality of part (e) (with J = J

)

shows that for every  > 0, there exists a µ



∈ M such that kJ



− J

k ≤ , which may be obtained by

letting µ



(x) minimize H(x, u, J

) over U (x) within an error of (1 − α) v(x), for all x ∈ X.

2.2 The Role of Monotonicity

Our analysis so far in this section relies only on the contraction assumption. We now introduce a monotonicity

property of a type that is common in DP.

A Weighted Sup-Norm Contraction Framework for DP

Assumption 2.2: (Monotonicity) If J, J

∈ R(X) and J ≤ J

, then

H(x, u, J) ≤ H(x, u, J

), ∀ x ∈ X, u ∈ U(x). (2.3)

Note that the assumption is equivalent to

J ≤ J

⇒ T

J ≤ T

, ∀ µ ∈ M,

and implies that

J ≤ J

⇒ T J ≤ T J

An important consequence of monotonicity of H, when it holds in addition to contraction, is that it implies

an optimality property of J

Proposition 2.2: Let Assumptions 2.1 and 2.2 hold. Then

(x) = inf

µ∈M

(x), ∀ x ∈ X. (2.4)

Furthermore, for every  > 0, there exists µ



∈ M such that

(x) ≤ J



(x) ≤ J

(x) +  v(x), ∀ x ∈ X. (2.5)

Proof: We note that the right-hand side of Eq. (2.5) holds by Prop. 2.1(e) (see the remark following its

proof). Thus inf

µ∈M

(x) ≤ J

(x) for all x ∈ X. To show the reverse inequality as well as the left-hand

side of Eq. (2.5), we note that for all µ ∈ M, we have T J

≤ T

, and since J

= T J

, it follows that

≤ T

. By applying repeatedly T

to both sides of this inequality and by using the Monotonicity

Assumption 2.2, we obtain J

≤ T

for all k > 0. Taking the limit as k → ∞, we see that J

≤ J

for all

µ ∈ M. Q.E.D.

Propositions 2.1 and 2.2 collectively address the problem of ﬁnding µ ∈ M that minimizes J

(x)

simultaneously for all x ∈ X, consistently with DP theory. The optimal value of this problem is J

(x), and

µ is optimal for all x if and only if T

= T J

. For this we just need the contraction and monotonicity

assumptions. We do not need any additional structure of H, such as for example a discrete-time dynamic

system, transition probabilities, etc. While identifying the proper structure of H and verifying its contraction

and monotonicity properties may require some analysis that is speciﬁc to each type of problem, once this is

done signiﬁcant results are obtained quickly.

A Weighted Sup-Norm Contraction Framework for DP

Note that without monotonicity, we may have inf

µ∈M

(x) < J

(x) for some x. As an example, let

X = {x

, x

}, U = {u

, u

}, and let

H(x

, u, J) =



−αJ(x

) if u = u

−1 + αJ(x

) if u = u

H(x

, u, J) =



0 if u = u

B if u = u

where B is a positive scalar. Then it can be seen that

) = −

1 − α

, J

) = 0,

and J

∗

= J

where µ

∗

) = u

and µ

∗

) = u

. On the other hand, for µ(x

) = u

and µ(x

) = u

, we

have

) = −αB, J

) = B,

so J

) < J

) for B suﬃciently large.

Nonstationary Policies

The connection with DP motivates us to consider the set Π of all sequences π = {µ

, µ

, . . .} with µ

∈ M

for all k (nonstationary policies in the DP context), and deﬁne

(x) = lim inf

k→∞

· · · T

J)(x), ∀ x ∈ X,

with J being any function in B(X), where as earlier, T

· · · T

J denotes the composition of the mappings

, . . . , T

applied to J. Note that the choice of J in the deﬁnition of J

does not matter since for any

two J, J

∈ B(X), we have from the Contraction Assumption 2.1,

· · · T

J − T

· · · T

k ≤ α

k+1

kJ − J

so the value of J

(x) is independent of J. Since by Prop. 2.1(b), J

(x) = lim

k→∞

J)(x) for all µ ∈ M,

J ∈ B(X), and x ∈ X, in the DP context we recognize J

as the cost function of the stationary policy

{µ, µ, . . .}.

We now claim that under our Assumptions 2.1 and 2.2, J

, the ﬁxed point of T , is equal to the optimal

value of J

, i.e.,

(x) = inf

π∈Π

(x), ∀ x ∈ X.

Indeed, since M deﬁnes a subset of Π, we have from Prop. 2.2,

(x) = inf

µ∈M

(x) ≥ inf

π∈Π

(x), ∀ x ∈ X,

while for every π ∈ Π and x ∈ X, we have

(x) = lim inf

k→∞

· · · T

J)(x) ≥ lim

k→∞

k+1

J)(x) = J

(x)

[the Monotonicity Assumption 2.2 can be used to show that

· · · T

J ≥ T

k+1

and the last equality holds by Prop. 2.1(b)]. Combining the preceding relations, we obtain J

(x) =

inf

π∈Π

(x).

Thus, in DP terms, we may view J

as an optimal cost function over all nonstationary policies. At the

same time, Prop. 2.2 states that stationary policies are suﬃcient in the sense that the optimal cost can be

attained to within arbitrary accuracy with a stationary policy [uniformly for all x ∈ X, as Eq. (2.5) shows].

A Weighted Sup-Norm Contraction Framework for DP

Periodic Policies

Consider the multistep mappings

= T

· · · T

m−1

, ν ∈

M, deﬁned in Example 2.7, where

M is the set

of m-tuples ν = (µ

, . . . , µ

m−1

), with µ

∈ M, k = 1, . . . , m − 1. Assuming that the mappings T

satisfy

Assumptions 2.1 and 2.2, the same is true for the mappings

(with the contraction modulus of

being

). Thus the unique ﬁxed point of

is J

, where π is the nonstationary but periodic policy

π = {µ

, . . . , µ

m−1

, µ

, . . . , µ

m−1

, . . .}.

Moreover the mappings T

· · · T

m−1

, T

· · · T

m−1

, . . . , T

m−1

· · · T

m−2

, have unique corresponding

ﬁxed points J

, J

, . . . , J

m−1

, which satisfy

= T

, J

= T

, . . . J

m−2

= T

m−1

, J

m−1

= T

To verify the above equations, multiply the relation J

= T

· · · T

m−1

with T

to show that T

the ﬁxed point of T

· · · T

m−1

, i.e., is equal to J

, etc. Note that even though

deﬁnes the cost functions

of periodic policies,

T has the same ﬁxed point as T , namely J

. This gives rise to the computational

possibility of working with

in place of T

in an eﬀort to ﬁnd J

. Moreover, periodic policies obtained

through approximation methods, such as the ones to be discussed in what follows, may hold an advantage

over stationary policies, as ﬁrst shown by Scherrer [Sch12] in the context of approximate value iteration (see

also the discussion of Section 4).

Error Bounds Under Monotonicity

The assumptions of contraction and monotonicity together can be characterized in a form that is useful for

the derivation of error bounds.

Proposition 2.3: The Contraction and Monotonicity Assumptions 2.1 and 2.2 hold if and only if

for all J, J

∈ B(X), µ ∈ M, and scalar c ≥ 0, we have

≤ J + c v ⇒ T

≤ T

J + αc v, (2.6)

where v is the weight function of the weighted sup-norm k · k.

Proof: Let the contraction and monotonicity assumptions hold. If J

≤ J + c v, we have

H(x, u, J

) ≤ H(x, u, J + c v) ≤ H(x, u, J) + αc v(x), ∀ x ∈ X, u ∈ U (x), (2.7)

where the left-side inequality follows from the monotonicity assumption and the right-side inequality follows

from the contraction assumption, which together with kvk = 1, implies that

H(x, u, J + c v) − H(x, u, J)

v(x)

≤ αkJ + c v − Jk = αc.

The condition (2.7) implies the desired condition (2.6). Conversely, condition (2.6) for c = 0 yields the

monotonicity assumption, while for c = kJ

− Jk it yields the contraction assumption. Q.E.D.

A Weighted Sup-Norm Contraction Framework for DP

We can use Prop. 2.3 to derive some useful variants of parts (d) and (e) of Prop. 2.1 (which assumes only

the contraction assumption). These variants will be used in the derivation of error bounds for computational

methods to be discussed in Sections 4-6.

Proposition 2.4: (Error Bounds Under Contraction and Monotonicity) Let Assumptions

2.1 and 2.2 hold.

(a) For any J ∈ B(X) and c ≥ 0, we have

T J ≤ J + c v ⇒ J

≤ J +

1 − α

J ≤ T J + c v ⇒ J ≤ J

1 − α

(b) For any J ∈ B(X), µ ∈ M, and c ≥ 0, we have

J ≤ J + c v ⇒ J

≤ J +

1 − α

J ≤ T

J + c v ⇒ J ≤ J

1 − α

T J ≤ J + c v ⇒ J

≤ T

J +

1 − α

J ≤ T J + c v ⇒ T

J ≤ J

1 − α

Proof: (a) We show the ﬁrst relation. Applying Eq. (2.6) with J

replaced by T J, and taking minimum

over u ∈ U (x) for all x ∈ X, we see that if T J ≤ J + c v, then T

J ≤ T J + αc v. Proceeding similarly, it

follows that

J ≤ T

`−1

J + α

`−1

 v.

We now write for every k,

J − J =

`=1

J − T

`−1

J) ≤

`=1

`−1

c v,

from which, by taking the limit as k → ∞, we obtain J

≤ J +



c/(1 − α)



v. The second relation follows

similarly.

(b) This part is the special case of part (a) where T is equal to T

≤ J +



c/(1 − α)



Applying T

to both sides of this inequality, and using the fact that T

is a monotone sup-norm contraction

A Weighted Sup-Norm Contraction Framework for DP

of modulus α

, with ﬁxed point J

, we obtain J

≤ T

J +



c/(1 − α)



v. The second relation follows

similarly. Q.E.D.

Approximations

As part of our subsequent analysis, we will consider approximations in the implementation of various VI and

PI algorithms. In particular, we will assume that given any J ∈ B(X), we cannot compute exactly T J, but

instead may compute

J ∈ B(X) and µ ∈ M such that

J − T Jk ≤ δ, kT

J − T Jk ≤ , (2.8)

where δ and  are nonnegative scalars. These scalars may be unknown, so the resulting analysis will have a

mostly qualitative character.

The case δ > 0 arises when the state space is either inﬁnite or it is ﬁnite but very large. Then instead

of calculating (TJ)(x) for all states x, one may do so only for some states and estimate (T J)(x) for the

remaining states x by some form of interpolation. Alternatively, one may use simulation data [e.g., noisy

values of (T J)(x) for some or all x] and some kind of least-squares error ﬁt of (T J)(x) with a function from

a suitable parametric class. The function

J thus obtained will satisfy a relation such as (2.8) with δ > 0.

Note that δ may not be small in this context, and the resulting performance degradation may be a primary

concern.

Cases where  > 0 may arise when the control space is inﬁnite or ﬁnite but large, and the minimization

involved in the calculation of (T J)(x) cannot be done exactly. Note, however, that it is possible that

δ > 0,  = 0,

and in fact this occurs in several types of practical methods. In an alternative scenario, we may ﬁrst obtain

the policy µ subject to a restriction that it belongs to a certain subset of structured policies, so it satisﬁes

J − T Jk ≤  for some  > 0, and then we may set

J = T

J. In this case we have  = δ.

3. LIMITED LOOKAHEAD POLICIES

A frequently used suboptimal approach in DP is to use a policy obtained by solving a ﬁnite-horizon problem

with some given terminal cost function

J that approximates J

. The simplest possibility is a one-step

lookahead policy ¯µ deﬁned by

¯µ(x) ∈ arg min

u∈U(x)

H(x, u,

J), x ∈ X. (3.1)

In a variant of the method that aims to reduce the computation to obtain ¯µ(i), the minimization in Eq. (3.1)

is done over a subset

U(x) ⊂ U(x). Thus, the control ¯µ(x) used in this variant satisﬁes

¯µ(x) ∈ arg min

u∈

U(i)

H(x, u,

J), x ∈ X,

rather Eq. (3.1). This is attractive for example when by using some heuristic or approximate optimization,

we can identify a subset

U(x) of promising controls, and to save computation, we restrict attention to this

subset in the one-step lookahead minimization.

A Weighted Sup-Norm Contraction Framework for DP

The following proposition gives some bounds for the performance of such a one-step lookahead policy.

The ﬁrst bound [part (a) of the following proposition] is given in terms of the vector

J given by

J(x) = inf

u∈

U(i)

H(x, u,

J), x ∈ X, (3.2)

which is computed in the course of ﬁnding the one-step lookahead control at state x.

Proposition 3.1: (One-Step Lookahead Error Bounds) Let Assumptions 2.1 and 2.2 hold,

and let ¯µ be a one-step lookahead policy obtained by minimization in Eq. (3.2).

(a) Assume that

J ≤

J. Then J

¯µ

≤

(b) Assume that

U(i) = U(i) for all i. Then

¯µ

−

Jk ≤

1 − α

J −

Jk, (3.3)

where k · k denotes the sup-norm. Moreover

¯µ

− J

k ≤

2α

1 − α

J − J

k, (3.4)

and

¯µ

− J

k ≤

1 − α

J −

Jk. (3.5)

Proof: (a) We have

J ≥

J = T

¯µ

from which by using the monotonicity of T

¯µ

, we obtain

J ≥

J ≥ T

¯µ

J ≥ T

k+1

¯µ

J, k = 1, 2, . . .

By taking the limit as k → ∞, we have

J ≥ J

¯µ

(b) The proof of this part may rely on Prop. 2.1(e), but we will give a direct proof. Using the triangle

inequality we write for every k,

¯µ

J −

Jk ≤

`=1

¯µ

J − T

`−1

¯µ

Jk ≤

`=1

`−1

¯µ

J −

Jk.

By taking the limit as k → ∞ and using the fact T

¯µ

J → J

¯µ

, we obtain

¯µ

−

Jk ≤

1 − α

¯µ

J −

Jk. (3.6)

Since

J = T

¯µ

J, we have

¯µ

J −

Jk = kT

¯µ

J − T

¯µ

Jk ≤ αk

J −

Jk,

A Weighted Sup-Norm Contraction Framework for DP

and Eq. (3.3) follows by combining the last two relations.

By repeating the proof of Eq. (3.6) with

J replaced by J

, we obtain

¯µ

− J

k ≤

1 − α

¯µ

− J

We have T

J = T

¯µ

J and J

= T J

, so

¯µ

− J

k ≤ kT

¯µ

− T

¯µ

Jk| + kT

J − T J

≤ αkJ

−

Jk + αk

J − J

= 2α k

J − J

and Eq. (3.4) follows by combining the last two relations.

Also, by repeating the proof of Eq. (3.6) with

J replaced by

J and T

¯µ

replaced by T , we have using

also

J = T

−

Jk ≤

1 − α

J −

Jk =

1 − α

J −

Jk.

We use this relation to write

¯µ

− J

k ≤ kJ

¯µ

−

Jk + k

J −

Jk + k

J − J

≤

1 − α

J −

Jk + k

J −

Jk +

1 − α

J −

1 − α

J −

Jk,

where the second inequality follows from Eq. (3.3). Q.E.D.

Part (b) of the preceding proposition gives a bound on J

¯µ

(x), the performance of the one-step lookahead

policy ¯µ; the value of

J(x) is obtained while ﬁnding the one-step lookahead control at x. The bound (3.4)

of part (c) says that if the one-step lookahead approximation

J is within c of the optimal (in the weighted

sup-norm sense), the performance of the one-step lookahead policy is within 2αc/(1−α) of the optimal. Part

(b) of the preceding proposition gives bounds on J

¯µ

(x), the performance of the one-step lookahead policy ¯µ.

In particular, the bound (3.4) says that if the one-step lookahead approximation

J is within  of the optimal,

the performance of the one-step lookahead policy is within 2α/(1 − α) of the optimal. Unfortunately, this is

not very reassuring when α is close to 1, in which case the error bound is very large relative to . Nonetheless,

the following example from [BeT96], Section 6.1.1, shows that this error bound is tight in the sense that for

any α < 1, there is a problem with just two states where the error bound is satisﬁed with equality. What is

happening is that an O() diﬀerence in single stage cost between two controls can generate an O



/(1 − α)



diﬀerence in policy costs, yet it can be “nulliﬁed” in Bellman’s equation by an O() diﬀerence between J

and

Example 3.1

Consider a discounted problem with two states, 1 and 2, and deterministic transitions. State 2 is absorbing,

but at state 1 there are two possible decisions: move to state 2 (policy µ

∗

) or stay at state 1 (policy µ). The

cost of each transition is 0 except for the transition from 1 to itself under policy µ, which has cost 2α, where

 is a positive scalar and α ∈ [0, 1) is the discount factor. The optimal policy µ

∗

is to move from state 1 to

A Weighted Sup-Norm Contraction Framework for DP

state 2, and the optimal cost-to-go function is J

∗

(1) = J

∗

(2) = 0. Consider the vector

J with

J(1) = − and

J(2) = , so that

J − J

∗

k = ,

as assumed in Eq. (3.4) [cf. Prop. 3.1(b)]. The policy µ that decides to stay at state 1 is a one-step lookahead

policy based on

J, because

2α + α

J(1) = α = 0 + α

J(2).

We have

(1) =

2α

1 − α

2α

1 − α

J − J

∗

so the bound of Eq. (3.4) holds with equality.

3.1 Multistep Lookahead Policies with Approximations

Let us now consider a more general form of lookahead involving multiple stages with intermediate approx-

imations. In particular, we assume that given any J ∈ B(X), we cannot compute exactly T J, but instead

may compute

J ∈ B(X) and µ ∈ M such that

J − T Jk ≤ δ, kT

J − T Jk ≤ , (3.7)

where δ and  are nonnegative scalars.

In a multistep method with approximations, we are given a positive integer m and a lookahead function

, and we successively compute (backwards in time) J

m−1

, . . . , J

and policies µ

m−1

, . . . , µ

satisfying

− T J

k+1

k ≤ δ, kT

k+1

− T J

k+1

k ≤ , k = 0, . . . , m − 1. (3.8)

Note that in the context of MDP, J

can be viewed as an approximation to the optimal cost function of an

(m − k)-stage problem with terminal cost function J

. We have the following proposition, which is based

on the recent work of Scherrer [Sch12].

Proposition 3.2: (Multistep Lookahead Error Bound) Let Assumption 2.1 hold. The periodic

policy

π = {µ

, . . . , µ

m−1

, µ

, . . . , µ

m−1

, . . .}

generated by the method of Eq. (3.8) satisﬁes

− J

k ≤

2α

1 − α

− J

k +



1 − α

α( + 2δ)(1 − α

m−1

)

(1 − α)(1 − α

)

. (3.9)

Proof: Using the triangle inequality, Eq. (3.8), and the contraction property of T , we have for all k

m−k

− T

k ≤ kJ

m−k

− T J

m−k+1

k + kT J

m−k+1

− T

m−k+2

+ · · · + kT

k−1

m−1

− T

≤ δ + αδ + · · · + α

k−1

δ,

(3.10)

A Weighted Sup-Norm Contraction Framework for DP

showing that

m−k

− T

k ≤

δ(1 − α

)

1 − α

, k = 1, . . . , m. (3.11)

From Eq. (3.8), we have kJ

− T

k+1

k ≤ δ + , so for all k

m−k

− T

m−k

· · · T

m−1

k ≤ kJ

m−k

− T

m−k

m−k+1

+ kT

m−k

m−k+1

− T

m−k

m−k+1

m−k+2

+ · · ·

+ kT

m−k

· · · T

m−2

m−1

− T

m−k

· · · T

m−1

≤ (δ + ) + α(δ + ) + · · · + α

k−1

(δ + ),

showing that

m−k

− T

m−k

· · · T

m−1

k ≤

(δ + )(1 − α

)

1 − α

, k = 1, . . . , m. (3.12)

Using the fact kT

− T J

k ≤  [cf. Eq. (3.8)], we obtain

· · · T

m−1

− T

k ≤ kT

· · · T

m−1

− T

+ kT

− T J

k + kT J

− T

≤ αkT

· · · T

m−1

− J

k +  + αkJ

− T

m−1

≤  +

α( + 2δ)(1 − α

m−1

)

1 − α

where the last inequality follows from Eqs. (3.11) and (3.12) for k = m − 1.

From this relation and the fact that T

· · · T

m−1

and T

are contractions with modulus α

, we obtain

· · · T

m−1

− J

k ≤ kT

· · · T

m−1

− T

· · · T

m−1

+ kT

· · · T

m−1

− T

k + kT

− J

≤ 2α

− J

k +  +

α( + 2δ)(1 − α

m−1

)

1 − α

We also have using Prop. 2.1(e), applied in the context of the multistep mapping of Example 1.6.5 of Section

1.6,

− J

k ≤

1 − α

· · · T

m−1

− J

Combining the last two relations, we obtain the desired result. Q.E.D.

Note that for m = 1 and δ =  = 0, i.e., the case of one-step lookahead policy ¯µ with lookahead function

and no approximation error in the minimization involved in T J

, Eq. (3.9) yields the bound

¯µ

− J

k ≤

2α

1 − α

− J

which coincides with the bound (3.4) derived earlier.

Also, in the special case where  = δ and J

= T

k+1

(cf. the discussion preceding Prop. 3.2), the

bound (3.9) can be strengthened somewhat. In particular, we have for all k, J

m−k

= T

m−k

· · · T

m−1

, so

the right-hand side of Eq. (3.12) becomes 0 and the preceding proof yields, with some calculation,

− J

k ≤

2α

1 − α

− J

k +

1 − α

αδ(1 − α

m−1

)

(1 − α)(1 − α

)

2α

1 − α

− J

k +

1 − α

Generalized Value Iteration

We ﬁnally note that Prop. 3.2 shows that as m → ∞, the corresponding bound for kJ

− J

k tends to

 + α( + 2δ)/(1 − α), or

lim sup

m→∞

− J

k ≤

 + 2αδ

1 − α

We will see that this error bound is superior to corresponding error bounds for approximate versions of VI

and PI by essentially a factor 1/(1−α). This is an interesting fact, which was ﬁrst shown by Scherrer [Sch12]

in the context of discounted MDP.

4. GENERALIZED VALUE ITERATION

Generalized value iteration (VI) is the algorithm that starts from some J ∈ B(X), and generates T J, T

J, . . ..

Since T is a weighted sup-norm contraction under Assumption 2.1, the algorithm converges to J

, and the

rate of convergence is governed by

J − J

k ≤ α

kJ − J

k, k = 0, 1, . . . .

Similarly, for a given policy µ ∈ M, we have

J − J

k ≤ α

kJ − J

k, k = 0, 1, . . . .

From Prop. 2.1(d), we also have the error bound

k+1

J − J

k ≤

1 − α

k+1

J − T

Jk, k = 0, 1, . . . .

This bound does not rely on the Monotonicity Assumption 2.2.

Suppose now that we use generalized VI to compute an approximation

J to J

, and then we obtain a

policy ¯µ by minimization of H(x, u,

J) over u ∈ U (x) for each x ∈ X. In other words

J and ¯µ satisfy

J − J

k ≤ γ, T

¯µ

J = T

where γ is some positive scalar. Then, with an identical proof to Prop. 3.1(c), we have

¯µ

≤ J

2α γ

1 − α

v, (4.1)

which can be viewed as an error bound for the performance of a policy obtained by generic one-step lookahead.

We use this bound in the following proposition, which shows that if the set of policies is ﬁnite, then a

policy µ

∗

with J

∗

= J

may be obtained after a ﬁnite number of VI.

Proposition 4.1: Let Assumption 2.1 hold and let J ∈ B(X). If the set of policies M is ﬁnite,

there exists an integer

k ≥ 0 such that J

∗

= J

for all µ

∗

and k ≥

k with T

∗

J = T

k+1

Proof: Let

M be the set of nonoptimal policies, i.e., all µ such that J

6= J

. Since

M is ﬁnite, we have

min

µ∈

− J

k > 0,

so by Eq. (4.1), there exists suﬃciently small β > 0 such that

J − J

k ≤ β and T

J = T

J ⇒ kJ

− J

k = 0 ⇒ µ /∈

M. (4.2)

It follows that if k is suﬃciently large so that kT

J − J

k ≤ β, then T

∗

J = T

k+1

J implies that µ

∗

/∈

so J

∗

= J

. Q.E.D.

Generalized Value Iteration

4.1 Approximate Value Iteration

Let us consider situations where the VI method may be implementable only through approximations. In

particular, given a function J, assume that we may only be able to calculate an approximation

J to T J such

that



J − T J



≤ δ, (4.3)

where δ is a given positive scalar. In the corresponding approximate VI method, we start from an arbitrary

bounded function J

, and we generate a sequence {J

} satisfying

k+1

− T J

k ≤ δ, k = 0, 1, . . . . (4.4)

This approximation may be the result of representing J

k+1

compactly, as a linear combination of basis

functions, through a projection or aggregation process, as is common in approximate DP.

We may also simultaneously generate a sequence of policies {µ

} such that

− T J

k ≤ , k = 0, 1, . . . , (4.5)

where  is some scalar [which could be equal to 0, as in case of Eq. (3.8), considered earlier]. The following

proposition shows that the corresponding cost vectors J

“converge” to J

to within an error of order



δ/(1 − α)



[plus a less signiﬁcant error of order O



/(1 − α)



Proposition 4.2: (Error Bounds for Approximate VI) Let Assumption 2.1 hold. A sequence

} generated by the approximate VI method (4.4)-(4.5) satisﬁes

lim sup

k→∞

− J

k ≤

1 − α

, (4.6)

while the corresponding sequence of policies {µ

} satisﬁes

lim sup

k→∞

− J

k ≤



1 − α

2αδ

(1 − α)

. (4.7)

Proof: Arguing as in the proof of Prop. 3.2, we have

− T

k ≤

δ(1 − α

)

1 − α

, k = 0, 1, . . .

[cf. Eq. (3.10)]. By taking limit as k → ∞ and by using the fact lim

k→∞

= J

, we obtain Eq. (4.6).

We also have using the triangle inequality and the contraction property of T

and T ,

− J

k ≤ kT

− T

k + kT

− T J

k + kT J

− J

≤ αkJ

− J

k +  + αkJ

− J

Generalized Policy Iteration

while by using also Prop. 2.1(e), we obtain

− J

k ≤

1 − α

− J

k ≤



1 − α

2α

1 − α

− J

By combining this relation with Eq. (4.6), we obtain Eq. (4.7). Q.E.D.

The error bound (4.7) relates to stationary policies obtained from the functions J

by one-step looka-

head. We may also obtain an m-step periodic policy π from J

by using m-step lookahead. Then Prop. 3.2

shows that the corresponding bound for kJ

− J

k tends to  + 2αδ/(1 − α) as m → ∞, which improves on

the error bound (4.7) by a factor 1/(1 − α). This is a remarkable and surprising fact, which was ﬁrst shown

by Scherrer [Sch12] in the context of discounted MDP.

Finally, let us note that the error bound of Prop. 4.2 is predicated upon generating a sequence {J

}

satisfying kJ

k+1

− T J

k ≤ δ for all k [cf. Eq. (4.4)]. Unfortunately, some practical approximation schemes

guarantee the existence of such a δ only if {J

} is a bounded sequence. The following simple example from

[BeT96], Section 6.5.3, shows that boundedness of the iterates is not automatically guaranteed, and is a

serious issue that should be addressed in approximate VI schemes.

Example 4.1 (Error Ampliﬁcation in Approximate Value Iteration)

Consider a two-state discounted MDP with states 1 and 2, and a single policy. The transitions are deterministic:

from state 1 to state 2, and from state 2 to state 2. These transitions are also cost-free. Thus we have

∗

(1) = J

∗

(2) = 0.

We consider a VI scheme that approximates cost functions within the one-dimensional subspace of linear

functions S =



(r, 2r) | r ∈ <



by using a weighted least squares minimization; i.e., we approximate a vector J

by its weighted Euclidean projection onto S. In particular, given J

= (r

, 2r

), we ﬁnd J

k+1

= (r

k+1

, 2r

k+1

where for weights w

, w

> 0, r

k+1

is obtained as

k+1

= arg min



r − (T J

)(1)



+ w



2r − (T J

)(2)



Since for a zero cost per stage and the given deterministic transitions, we have T J

= (2αr

, 2αr

), the preceding

minimization is written as

k+1

= arg min



(r − 2αr

)

+ w

(2r − 2αr

)



which by writing the corresponding optimality condition yields r

k+1

= αβr

, where β = 2(w

+2w

)(w

+4w

) >

1. Thus if α > 1/β, the sequence {r

} diverges and so does {J

}. Note that in this example the optimal cost

function J

∗

= (0, 0) belongs to the subspace S. The diﬃculty here is that the approximate VI mapping that

generates J

k+1

by a least squares-based approximation of T J

is not a contraction. At the same time there is

no δ such that kJ

k+1

− T J

k ≤ δ for all k, because of error ampliﬁcation in each approximate VI.

5. GENERALIZED POLICY ITERATION

In generalized policy iteration (PI), we maintain and update a policy µ

, starting from some initial policy

. The (k + 1)st iteration has the following form.

Generalized Policy Iteration

Policy Evaluation: We compute J

as the unique solution of the equation J

= T

Policy Improvement: We obtain an improved policy µ

k+1

that satisﬁes T

k+1

= T J

The algorithm requires the Monotonicity Assumption 2.2, in addition to the Contraction Assumption

2.1, so we assume these two conditions throughout this section. Moreover we assume that the minimum

of H(x, u, J

) over u ∈ U(x) is attained for all x ∈ X, so that the improved policy µ

k+1

is deﬁned. The

following proposition establishes a basic cost improvement property, as well as ﬁnite convergence for the case

where the set of policies is ﬁnite.

Proposition 5.1: (Convergence of Generalized PI) Let Assumptions 2.1 and 2.2 hold, and let

{µ

} be a sequence generated by the generalized PI algorithm. Then for all k, we have J

k+1

≤ J

with equality if and only if J

= J

. Moreover,

lim

k→∞

− J

k = 0,

and if the set of policies is ﬁnite, we have J

= J

for some k.

Proof: We have

k+1

= T J

≤ T

= J

Applying T

k+1

to this inequality while using the Monotonicity Assumption 2.2, we obtain

k+1

≤ T

k+1

= T J

≤ T

= J

Similarly, we have for all m > 0,

k+1

≤ T J

≤ J

and by taking the limit as m → ∞, we obtain

k+1

≤ T J

≤ J

, k = 0, 1, . . . . (5.1)

If J

k+1

= J

, it follows that T J

= J

, so J

is a ﬁxed point of T and must be equal to J

. Moreover

by using induction, Eq. (5.1) implies that

≤ T

, k = 0, 1, . . . ,

Since

≤ J

, lim

k→∞

− J

k = 0,

Generalized Policy Iteration

it follows that lim

k→∞

− J

k = 0. Finally, if the number of policies is ﬁnite, Eq. (5.1) implies that

there can be only a ﬁnite number of iterations for which J

k+1

(x) < J

(x) for some x, so we must have

k+1

= J

for some k, at which time J

= J

as shown earlier. Q.E.D.

In the case where the set of policies is inﬁnite, we may assert the convergence of the sequence of

generated policies under some compactness and continuity conditions. In particular, we will assume that

the state space is ﬁnite, X = {1, . . . , n}, and that each control constraint set U(x) is a compact subset of

. We will view a cost vector J as an element of <

, and a policy µ as an element of the compact set

U(1) × · · · × U (n) ⊂ <

. Then {µ

} has at least one limit point ¯µ, which must be an admissible policy.

The following proposition guarantees, under an additional continuity assumption for H(x, ·, ·), that every

limit point ¯µ is optimal.

Assumption 5.1: (Compactness and Continuity)

(a) The state space is ﬁnite, X = {1, . . . , n}.

(b) Each control constraint set U(x), x = 1, . . . , n, is a compact subset of <

Proposition 5.2: Let Assumptions 2.1, 2.2, and 5.1 hold, and let {µ

} be a sequence generated

by the generalized PI algorithm. Then for every limit point ¯µ of {µ

}, we have J

¯µ

= J

∗

Proof: We have J

→ J

by Prop. 5.1. Let ¯µ be the limit of a subsequence {µ

}

k∈K

. We will show that

¯µ

= T J

, from which it follows that J

¯µ

= J

[cf. Prop. 2.1(c)]. Indeed, we have T

¯µ

≥ T J

, so we focus

on showing the reverse inequality. From the equation T

k−1

= T J

k−1

we have



x, µ

(x), J

k−1



≤ H(x, u, J

k−1

), x = 1, . . . , n, u ∈ U (x).

By taking limit in this relation as k → ∞, k ∈ K, and by using the continuity of H(x, ·, ·) [cf. Assumption

5.1(c)], we obtain



x, ¯µ(x), J



≤ H(x, u, J

), x = 1, . . . , n, u ∈ U (x).

By taking the minimum of the right-hand side over u ∈ U(x), we obtain T

¯µ

≤ T J

. Q.E.D.

5.1 Approximate Policy Iteration

We now consider the PI method where the policy evaluation step and/or the policy improvement step of the

method are implemented through approximations. This method generates a sequence of policies {µ

} and a

corresponding sequence of approximate cost functions {J

} satisfying

− J

k ≤ δ, kT

k+1

− T J

k ≤ , k = 0, 1, . . . , (5.2)

Generalized Policy Iteration

where k · k denotes the sup-norm and v is the weight vector of the weighted sup-norm (it is important to use

v rather than the unit vector in the above equation, in order for the bounds obtained to have a clean form).

The following proposition provides an error bound for this algorithm, which extends a corresponding result

of [BeT96], shown for discounted MDP.

Proposition 5.3: (Error Bound for Approximate PI) Let Assumptions 2.1 and 2.2 hold. The

sequence {µ

} generated by the approximate PI algorithm (5.2) satisﬁes

lim sup

k→∞

− J

k ≤

 + 2αδ

(1 − α)

. (5.3)

The essence of the proof is contained in the following proposition, which quantiﬁes the amount of

approximate policy improvement at each iteration.

Proposition 5.4: Let Assumptions 2.1 and 2.2 hold. Let J, ¯µ, and µ satisfy

kJ − J

k ≤ δ, kT

¯µ

J − T Jk ≤ ,

where δ and  are some scalars. Then

¯µ

− J

k ≤ αkJ

− J

k +

 + 2αδ

1 − α

. (5.4)

Proof: Using Eq. (5.4) and the contraction property of T and T

¯µ

, which implies that kT

¯µ

− T

¯µ

Jk ≤ αδ

and kT J − T J

k ≤ αδ, and hence T

¯µ

≤ T

¯µ

J + αδ v and T J ≤ T J

+ αδ v, we have

¯µ

≤ T

¯µ

J + αδ v ≤ T J + ( + αδ) v ≤ T J

+ ( + 2αδ) v. (5.5)

Since T J

≤ T

= J

, this relation yields

¯µ

≤ J

+ ( + 2αδ) v,

and applying Prop. 2.4(b) with µ = ¯µ, J = J

, and  =  + 2αδ, we obtain

¯µ

≤ J

 + 2αδ

1 − α

v. (5.6)

Using this relation, we have

¯µ

= T

¯µ

= T

¯µ

+ (T

¯µ

− T

¯µ

) ≤ T

¯µ

α( + 2αδ)

1 − α

Generalized Policy Iteration

where the inequality follows by using Prop. 2.3 and Eq. (5.6). Subtracting J

from both sides, we have

¯µ

− J

≤ T

¯µ

− J

α( + 2αδ)

1 − α

v, (5.7)

Also by subtracting J

from both sides of Eq. (5.5), and using the contraction property

T J

− J

= T J

− T J

≤ αkJ

− J

k v,

yields

¯µ

− J

≤ T J

− J

+ ( + 2αδ) v ≤ αkJ

− J

k v + ( + 2αδ) v.

Combining this relation with Eq. (5.7), we obtain

¯µ

− J

≤ αkJ

− J

k v +

α( + 2αδ)

1 − α

v + ( + αδ)e = αkJ

− J

k v +

 + 2αδ

1 − α

which is equivalent to the desired relation (5.4). Q.E.D.

Proof of Prop. 5.3: Applying Prop. 5.4, we have

k+1

− J

k ≤ αkJ

− J

k +

 + 2αδ

1 − α

which by taking the lim sup of both sides as k → ∞ yields the desired result. Q.E.D.

We note that the error bound of Prop. 5.3 is tight, as can be shown with an example from [BeT96],

Section 6.2.3. The error bound is comparable to the one for approximate VI, derived earlier in Prop. 4.2. In

particular, the error kJ

− J

k is asymptotically proportional to 1/(1 − α)

and to the approximation error

in policy evaluation or value iteration, respectively. This is noteworthy, as it indicates that contrary to the

case of exact implementation, approximate PI need not hold a convergence rate advantage over approximate

VI, despite its greater overhead per iteration.

On the other hand, approximate PI does not have as much diﬃculty with the kind of iteration instability

that was illustrated by Example 4.1 for approximate VI. In particular, if the set of policies is ﬁnite, so that

the sequence {J

} is guaranteed to be bounded, the assumption of Eq. (5.2) is not hard to satisfy in practice

with the cost function approximation methods to be discussed in Chapters 6 and 7.

Note that when δ =  = 0, Eq. (5.4) yields

k+1

− J

k ≤ αkJ

− J

Thus in the case of an inﬁnite state space and/or control space, exact PI converges at a geometric rate

under the contraction and monotonicity assumptions of this section. This rate is the same as the rate of

convergence of exact VI.

Generalized Policy Iteration

The Case Where Policies Converge

Generally, the policy sequence {µ

} generated by approximate PI may oscillate between several policies.

However, under some circumstances this sequence may be guaranteed to converge to some ¯µ, in the sense

that

k+1

= µ

= ¯µ for some

k. (5.8)

An example arises when the policy sequence {µ

} is generated by exact PI applied with a diﬀerent mapping

H in place of H, but the bounds of Eq. (5.2) are satisﬁed. The mapping

H may for example correspond to

an approximation of the original problem, as in aggregation methods. In this case we can show the following

bound, which is much more favorable than the one of Prop. 5.3.

Proposition 5.5: (Error Bound for Approximate PI when Policies Converge) Let As-

sumptions 2.1 and 2.2 hold, and let ¯µ be a policy generated by the approximate PI algorithm (5.2)

and satisfying condition (5.8). Then we have

¯µ

− J

k ≤

 + 2αδ

1 − α

. (5.9)

Proof: Let

J be the cost vector obtained by approximate policy evaluation of ¯µ [i.e.,

J = J

, where

satisﬁes the condition (5.8)]. Then we have

J − J

¯µ

k ≤ δ, kT

¯µ

J − T

Jk ≤ , (5.10)

where the latter inequality holds since we have

¯µ

J − T

Jk = kT

k+1

− T J

k ≤ ,

cf. Eq. (5.2). Using Eq. (5.10) and the fact J

¯µ

= T

¯µ

, we have

kT J

¯µ

− J

¯µ

k ≤ kT J

¯µ

− T

Jk + kT

J − T

¯µ

Jk + kT

¯µ

J − J

¯µ

= kT J

¯µ

− T

Jk + kT

J − T

¯µ

Jk + kT

¯µ

J − T

¯µ

≤ αkJ

¯µ

−

Jk +  + αk

J − J

¯µ

≤  + 2αδ.

(5.11)

Using Prop. 2.1(d) with J = J

¯µ

, we obtain the error bound (5.9). Q.E.D.

The preceding error bound can be generalized to the case where two successive policies generated by

the approximate PI algorithm are “not too diﬀerent” rather than being identical. In particular, suppose that

µ and ¯µ are successive policies, which in addition to

J − J

k ≤ δ, kT

¯µ

J − T

Jk ≤ ,

Optimistic Policy Iteration

[cf. Eq. (5.2)], also satisfy

J − T

¯µ

Jk ≤ ζ,

where ζ is some scalar (instead of µ = ¯µ, which is the case where policies converge exactly). Then we also

have

J − T

¯µ

Jk ≤ kT

J − T

Jk + kT

J − T

¯µ

Jk ≤  + ζ,

and by replacing  with  + ζ in Eq. (5.11), we obtain

¯µ

− J

k ≤

 + ζ + 2αδ

1 − α

When ζ is small enough to be of the order of max{δ, }, this error bound is comparable to the one for the

case where policies converge.

6. OPTIMISTIC POLICY ITERATION

In optimistic PI (also called “modiﬁed” PI, see e.g., [Put94]) each policy µ

is evaluated by solving the

equation J

= T

approximately, using a ﬁnite number of VI. Thus, starting with a function J

∈ B(X),

we generate sequences {J

} and {µ

} with the algorithm

= T J

, J

k+1

= T

, k = 0, 1, . . . , (6.1)

where {m

} is a sequence of positive integers.

A more general form of optimistic policy iteration, considered by Thiery and Scherrer [ThS10b], is

= T J

, J

k+1

∞

`=1

, k = 0, 1, . . . , (6.2)

where {λ

} is a sequence of nonnegative scalars such that

∞

`=1

= 1.

An example is the λ-policy iteration method (Bertsekas and Ioﬀe [BeI96], Thiery and Scherrer [ThS10a],

Bertsekas [Ber11], Scherrer [Sch11]), where λ

= (1 − λ)λ

`−1

, with λ being a scalar in (0, 1). For simplicity,

we will not discuss the more general type of algorithm (6.2) in this paper, but some of our results admit

straightforward extensions to this case, particularly the analysis of Section 6.2, and the SSP analysis of

Section 7.

6.1 Convergence of Optimistic Policy Iteration

The following two propositions provide the convergence properties of the algorithm (6.1). These propositions

have been proved by Rothblum [Rot79] within the framework of Denardo’s model [Der67], i.e., the case of an

unweighted sup-norm where v is the unit function; see also Canbolat and Rothblum [CaR11], which considers

optimistic PI methods where the minimization in the policy improvement (but not the policy evaluation)

operation is approximate, within some  > 0. Our proof follows closely the one of Rothblum [Rot79].

Optimistic Policy Iteration

Proposition 6.1: (Convergence of Optimistic Generalized PI) Let Assumptions 2.1 and 2.2

hold, and let



, µ

)



be a sequence generated by the optimistic generalized PI algorithm (6.1).

Then

lim

k→∞

− J

k = 0,

and if the number of policies is ﬁnite, we have J

= J

for all k greater than some index

Proposition 6.2: Let Assumptions 2.1, 2.2, and 5.1 hold, and let



, µ

)



be a sequence gen-

erated by the optimistic generalized PI algorithm (6.1). Then every limit point ¯µ of {µ

}, satisﬁes

¯µ

= J

We develop the proofs of the propositions through four lemmas. The ﬁrst lemma collects some generic

properties of monotone weighted sup-norm contractions, variants of which we have noted earlier, and we

restate for convenience.

Lemma 6.1: Let W : B(X) 7→ B(X) be a mapping that satisﬁes the monotonicity assumption

J ≤ J

⇒ W J ≤ W J

, ∀ J, J

∈ B(X),

and the contraction assumption

kW J − W J

k ≤ αkJ − J

k, ∀ J, J

∈ B(X),

for some α ∈ (0, 1).

(a) For all J, J

∈ B(X) and scalar c ≥ 0, we have

J ≥ J

− c v ⇒ WJ ≥ W J

− αc v. (6.3)

(b) For all J ∈ B(X), c ≥ 0, and k = 0, 1, . . ., we have

J ≥ W J − c v ⇒ W

J ≥ J

−

1 − α

c v, (6.4)

W J ≥ J − c v ⇒ J

≥ W

J −

1 − α

c v, (6.5)

where J

is the ﬁxed point of W .

Optimistic Policy Iteration

Proof: Part (a) essentially follows from Prop. 2.3, while part (b) essentially follows from Prop. 2.4(c).

Q.E.D.

Lemma 6.2: Let Assumptions 2.1 and 2.2 hold, and let J ∈ B(X) and c ≥ 0 satisfy

J ≥ T J − c v,

and let µ ∈ M be such that T

J = T J. Then for all k > 0, we have

T J ≥ T

J −

1 − α

c v, (6.6)

and

J ≥ T (T

J) − α

c v. (6.7)

Proof: Since J ≥ T J − c v = T

J − c v, by using Lemma 6.1(a) with W = T

and J

= T

J, we have for

all j ≥ 1,

J ≥ T

j+1

J − α

c v. (6.8)

By adding this relation over j = 1, . . . , k − 1, we have

T J = T

J ≥ T

J −

k−1

j=1

c v = T

J −

α − α

1 − α

c v ≥ T

J −

1 − α

c v,

showing Eq. (6.6). From Eq. (6.8) for j = k, we obtain

J ≥ T

k+1

J − α

c v = T

J) − α

c v ≥ T (T

J) − α

c v,

showing Eq. (6.7). Q.E.D.

The next lemma applies to the optimistic generalized PI algorithm (6.1) and proves a preliminary

bound.

Lemma 6.3: Let Assumptions 2.1 and 2.2 hold, let



, µ

)



be a sequence generated by the PI

algorithm (6.1), and assume that for some c ≥ 0 we have

≥ T J

− c v.

Then for all k ≥ 0,

T J

1 − α

c v ≥ J

k+1

≥ T J

k+1

− β

k+1

c v, (6.9)

Optimistic Policy Iteration

where β

is the scalar given by



1 if k = 0,

+···+m

k−1

if k > 0,

(6.10)

with m

, j = 0, 1, . . ., being the integers used in the algorithm (6.1).

Proof: We prove Eq. (6.9) by induction on k, using Lemma 6.2. For k = 0, using Eq. (6.6) with J = J

µ = µ

, and k = m

, we have

T J

≥ J

−

1 − α

c v = J

−

1 − α

c v,

showing the left-hand side of Eq. (6.9) for k = 0. Also by Eq. (6.7) with µ = µ

and k = m

, we have

≥ T J

− α

c v = T J

− β

c v.

showing the right-hand side of Eq. (6.9) for k = 0.

Assuming that Eq. (6.9) holds for k − 1 ≥ 0, we will show that it holds for k. Indeed, the right-hand

side of the induction hypothesis yields

≥ T J

− β

c v.

Using Eqs. (6.6) and (6.7) with J = J

, µ = µ

, and k = m

, we obtain

T J

≥ J

k+1

−

1 − α

c v,

and

k+1

≥ T J

k+1

− α

c v = T J

k+1

− β

k+1

c v,

respectively. This completes the induction. Q.E.D.

The next lemma essentially proves the convergence of the generalized optimistic PI (Prop. 6.1) and

provides associated error bounds.

Lemma 6.4: Let Assumptions 2.1 and 2.2 hold, let



, µ

)



be a sequence generated by the PI

algorithm (6.1), and let c ≥ 0 be a scalar such that

− T J

k ≤ c. (6.11)

Then for all k ≥ 0,

1 − α

c v ≥ J

1 − α

c v ≥ J

≥ J

−

(k + 1)α

1 − α

c v, (6.12)

where β

is deﬁned by Eq. (6.10).

Optimistic Policy Iteration

Proof: Using the relation J

≥ T J

− c v [cf. Eq. (6.11)] and Lemma 6.3, we have

≥ T J

− β

c v, k = 0, 1, . . . .

Using this relation in Lemma 6.1(b) with W = T and k = 0, we obtain

≥ J

−

1 − α

c v,

which together with the fact α

≥ β

, shows the left-hand side of Eq. (6.12).

Using the relation T J

≥ J

− c v [cf. Eq. (6.11)] and Lemma 6.1(b) with W = T , we have

≥ T

−

1 − α

c v, k = 0, 1, . . . . (6.13)

Using again the relation J

≥ T J

− c v in conjunction with Lemma 6.3, we also have

T J

≥ J

j+1

−

1 − α

c v, j = 0, . . . , k − 1.

Applying T

k−j−1

to both sides of this inequality and using the monotonicity and contraction properties of

k−j−1

, we obtain

k−j

≥ T

k−j−1

j+1

−

k−j

1 − α

c v, j = 0, . . . , k − 1,

cf. Lemma 6.1(a). By adding this relation over j = 0, . . . , k − 1, and using the fact β

≤ α

, it follows that

≥ J

−

k−1

j=0

k−j

1 − α

c v = J

−

kα

1 − α

c v. (6.14)

Finally, by combining Eqs. (6.13) and (6.14), we obtain the right-hand side of Eq. (6.12). Q.E.D.

Proof of Props. 6.1 and 6.2: Let c be a scalar satisfying Eq. (6.11). Then the error bounds (6.12) show

that lim

k→∞

− J

k = 0, i.e., the ﬁrst part of Prop. 6.1. The second part (ﬁnite termination when the

number of policies is ﬁnite) follows similar to Prop. 5.1. The proof of Prop. 6.2 follows using the Compactness

and Continuity Assumption 5.1, and the convergence argument of Prop. 5.2. Q.E.D.

Convergence Rate Issues

Let us consider the convergence rate bounds of Lemma 6.4 for generalized optimistic PI, and write them in

the form

− T J

k ≤ c ⇒ J

−

(k + 1)α

1 − α

c v ≤ J

≤ J

+···+m

1 − α

c v. (6.15)

We may contrast these bounds with the ones for generalized VI, where

− T J

k ≤ c ⇒ T

−

1 − α

c v ≤ J

≤ T

1 − α

c v (6.16)

[cf. Prop. 2.4(c)].

Optimistic Policy Iteration

In comparing the bounds (6.15) and (6.16), we should also take into account the associated overhead

for a single iteration of each method: optimistic PI requires at iteration k a single application of T and

− 1 applications of T

(each being less time-consuming than an application of T ), while VI requires a

single application of T . It can then be seen that the upper bound for optimistic PI is better than the one

for VI (same bound for less overhead), while the lower bound for optimistic PI is worse than the one for VI

(worse bound for more overhead). This suggests that the choice of the initial condition J

is important in

optimistic PI, and in particular it is preferable to have J

≥ T J

(implying convergence to J

from above)

rather than J

≤ T J

(implying convergence to J

from below). This is consistent with the results of other

works, which indicate that the convergence properties of the method are fragile when the condition J

≥ T J

does not hold (see [WiB93], [BeT96], [BeY10a], [BeY10b], [YuB11]).

6.2 Approximate Optimistic Policy Iteration

We now consider error bounds for the case where the policy evaluation and policy improvement operations

are approximate, similar to the nonoptimistic PI case of Section 5.1. In particular, we consider a method

that generates a sequence of policies {µ

} and a corresponding sequence of approximate cost functions {J

}

satisfying

− T

k−1

k ≤ δ, kT

k+1

− T J

k ≤ , k = 0, 1, . . . , (6.17)

[cf. Eq. (5.2)]. For example, we may compute (perhaps approximately, by simulation) the values T

(x) for

a subset of states x, and use a least squares ﬁt of these values to select J

from some parametric class of

functions.

We will prove the same error bound as for the nonoptimistic case, cf. Eq. (5.3). However, for this we

will need the following condition, which is stronger than the contraction and monotonicity conditions that

we have been using so far.

Assumption 6.1: (Semilinear Monotonic Contraction) For all J ∈ B(X) and µ ∈ M, the

functions T

J and T J belong to B(X). Furthermore, for some α ∈ (0, 1), we have

)(x) − (T

J)(x)

v(x)

≤ α sup

y∈X

(y) − J(y)

v(y)

, ∀ J, J

∈ B(X), µ ≤ M, x ∈ X. (6.18)

This assumption implies both the Contraction and Monotonicity Assumptions 2.1 and 2.2, as can be

easily veriﬁed. Moreover the assumption is satisﬁed in all of the discounted DP examples of Section 2, as

well as the SSP problem of the next section. It holds if T

is a linear mapping involving a matrix with

nonnegative components that has spectral radius less than 1 (or more generally if T

is the minimum or the

maximum of a ﬁnite number of such linear mappings).

For any function y ∈ B(X), let us use the notation

M(y) = sup

x∈X

y(x)

v(x)

. (6.19)

Optimistic Policy Iteration

Then the condition (6.18) can be written as

M(T

J − T

) ≤ αM(J − J

), ∀ J, J

∈ B(X), µ ∈ M, (6.20)

and also implies the following multistep versions,

J − T

≤ α

M(J − J

)v, M (T

J − T

) ≤ α

M(J − J

), ∀ J, J

∈ B(X), µ ∈ M, ` ≥ 1, (6.21)

which can be proved by induction using Eq. (6.20). We have the following proposition, whose proof follows

closely the original one by Thiery and Scherrer [ThS10b], given for the case of a discounted MDP.

Proposition 6.3: (Error Bound for Optimistic Approximate PI) Let Assumption 6.1 hold.

Then the sequence {µ

} generated by the optimistic approximate PI algorithm (6.17) satisﬁes

lim sup

k→∞

− J

k ≤

 + 2αδ

(1 − α)

. (6.22)

Proof: Let us ﬁx k ≥ 1 and for simplicity let us denote

J = J

k−1

, J = J

µ = µ

, ¯µ = µ

k+1

, m = m

, ¯m = m

k+1

s = J

− T

J, ¯s = J

¯µ

− T

¯m

¯µ

J, t = T

J − J

t = T

¯m

¯µ

J − J

We have

− J

= J

− T

J + T

J − J

= s + t. (6.23)

We will derive recursive relations for s and t, which will also involve the residual functions

r = T

J − J, ¯r = T

¯µ

J − J.

We ﬁrst obtain a relation between r and ¯r. We have

¯r = T

¯µ

J − J

= (T

¯µ

J − T

J) + (T

J − J)

≤ (T

¯µ

J − T J) +



J − T



+ (T

J − J) +



J) − T



≤ v + αM(J − T

J)v + δv + α

M(T

J − J)v

≤ ( + δ)v + αδv + α

M(r)v,

where the ﬁrst inequality follows from T

¯µ

J ≥ T J, and the second and third inequalities follow from Eqs.

(6.17) and (6.21). From this relation we have

M(¯r) ≤



 + (1 + α)δ



+ βM (r),

Optimistic Policy Iteration

where β = α

. Taking lim sup as k → ∞ in this relation, we obtain

lim sup

k→∞

M(r) ≤

 + (1 + α)δ

1 −

, (6.24)

where

β = α

lim inf

k→∞

Next we derive a relation between s and r. We have

s = J

− T

= T

− T

≤ α

M(J

− J)v

≤

1 − α

M(T

J − J)v

1 − α

M(r)v,

where the ﬁrst inequality follows from Eq. (6.21) and the second inequality follows by using Prop. 2.4(b).

Thus we have M (s) ≤

1−α

M(r), from which by taking lim sup of both sides and using Eq. (6.24), we obtain

lim sup

k→∞

M(s) ≤



 + (1 + α)δ



(1 − α)(1 −

β)

. (6.25)

Finally we derive a relation between t,

t, and r. We ﬁrst note that

T J − T J

≤ αM(J − J

= αM(J − T

J + T

J − J

≤ αM(J − T

J)v + αM(T

J − J

≤ αδv + αM (t)v.

Using this relation, and Eqs. (6.17) and (6.21), we have

t = T

¯m

¯µ

J − J

= (T

¯m

¯µ

J − T

¯m−1

¯µ

J) + · · · + (T

¯µ

J − T

¯µ

J) + (T

¯µ

J − T J) + (T J − T J

)

≤ (α

¯m−1

+ · · · + α)M(T

¯µ

J − J)v + v + αδv + αM(t)v,

so ﬁnally

t) ≤

α − α

¯m

1 − α

M(¯r) + ( + αδ) + αM (t).

By taking lim sup of both sides and using Eq. (6.24), it follows that

lim sup

k→∞

M(t) ≤

(α −

β)



 + (1 + α)δ



(1 − α)

(1 −

β)

 + αδ

1 − α

. (6.26)

We now combine Eqs. (6.23), (6.25), and (6.26). We obtain

lim sup

k→∞

M(J

− J

) ≤ lim sup

k→∞

M(s) + lim sup

k→∞

M(t)

≤



 + (1 + α)δ



(1 − α)(1 −

β)

(α −

β)



 + (1 + α)δ



(1 − α)

(1 −

β)

 + αδ

1 − α



β(1 − α) + (α −

β)



 + (1 + α)δ



(1 − α)

(1 −

β)

 + αδ

1 − α



 + (1 + α)δ



(1 − α)

 + αδ

1 − α

 + 2αδ

(1 − α)

Stochastic Shortest Path Problems

This proves the result, since in view of J

≥ J

, we have M(J

− J

) = kJ

− J

k. Q.E.D.

Note that generally, optimistic PI with approximations is susceptible to the instability phenomenon

illustrated by Example 4.1. In particular, when m

= 1 for all k in Eq. (6.17), the method becomes essentially

identical to approximate VI. However, it appears that choices of m

that are signiﬁcatly larger than 1 should

be helpful in connection with this diﬃculty. In particular, it can be veriﬁed that in Example 4.1, the method

converges to the optimal cost function if m

is suﬃciently large.

A remarkable fact is that approximate VI, approximate PI, and approximate optimistic PI have very

similar error bounds (cf. Props. 4.2, 5.3, and 6.3). Approximate VI has a slightly better bound, but insignif-

icantly so in practical terms.

7. STOCHASTIC SHORTEST PATH PROBLEMS

The SSP problem is a total cost inﬁnite horizon DP problem where:

(a) There is no discounting (α = 1).

(b) The state space is X = {0, 1, . . . , n} and we are given transition probabilities, denoted by

(u) = P (x

k+1

= j | x

= i, u

= u), i, j ∈ X, u ∈ U(i).

(d) A cost g(i, u) is incurred when control u ∈ U(i) is selected at state i.

(e) State 0 is a special destination state, which is absorbing and cost-free, i.e.,

(u) = 1,

and for all u ∈ U(0), g(0, u) = 0.

We have assumed for convenience that the cost per stage does not depend on the successor state. This

amounts to using expected cost per stage in all calculations. In particular, if the cost of applying control u

at state i and moving to state j is ˜g(i, u, j), we use as cost per stage the expected cost

g(i, u) =

j=0,1,...,n

(u)˜g(i, u, j),

and the subsequent analysis goes through with no change.

Since the destination 0 is cost-free and absorbing, the cost starting from 0 is zero for every policy.

Accordingly, for all cost functions, we ignore the component that corresponds to 0, and deﬁne

H(i, u, J) = g(i, u) +

j=1

(u)J(j), i = 1, . . . , n, u ∈ U (i), J ∈ <

The mappings T and T

are deﬁned by

(T J)(i) = min

u∈U(i)





g(i, u) +

j=1

(u)J(j)





, i = 1, . . . , n,

Stochastic Shortest Path Problems

J)(i) = g



i, µ(i)



j=1



µ(i)



J(j), i = 1, . . . , n.

Note that H satisﬁes the Monotonicity Assumption 2.2.

We say that a policy µ is proper if, when using this policy, there is positive probability that the

destination will be reached after at most n stages, regardless of the initial state; i.e., if

= max

i=1,...,n

P {x

6= 0 | x

= i, µ} < 1. (7.1)

It can be seen that µ is proper if and only if in the Markov chain corresponding to µ, each state i is connected

to the destination with a path of positive probability transitions.

Throughout this section we assume that all policies are proper. Without this assumption, the mapping

T need not be a contraction, as is well known (see [BeT91]). On the other hand, we have the following

proposition [see e.g., [BeT96], Prop. 2.2; earlier proofs were given by Veinott [Vei69] (who attributes the

result to A. J. Hoﬀman), and Tseng [Tse90]]. Note that while [BeT96] assumes that U(i) is ﬁnite for all i,

the proof given there, in conjunction with the results of [BeT91], extends to the more general case considered

here.

Proposition 7.1: Assume that all policies are proper. Then, there exists a vector v =



v(1), . . . , v(n)



with positive components such that the Contraction Assumption 2.1 holds, and the modulus of con-

traction is given by

α = max

i=1,...,n

v(i) − 1

v(i)

There is a generalization of the preceding proposition to SSP problems with a destination 0 and a

countable number of other states, denoted 1, 2, . . .. Let v(i) be the maximum (over all policies) expected

number of stages up to termination, starting from state i. Then if v(i) is ﬁnite and bounded over i, the

mappings T and T

are contraction mappings with respect to the weighted sup-norm with weight vector

v =



v(1), v(2), . . .



. The proof is similar to the proof of Prop. 7.1, and is given in [Ber12], Section 3.5 and

Exercise 2.11.

We ﬁnally note that the weighted sup-norm contraction property of Prop. 7.1 is algorithmically signif-

icant, because it brings to bear the preceding algorithms and analysis. In particular, the error bounds of

Sections 4 and 5 for approximate VI and PI are valid when specialized to SSP problems, and the optimistic

PI method of Section 6.1 is convergent. We provide the corresponding analysis in the next two subsections.

7.1 Approximate Policy Iteration

Consider an approximate PI algorithm that generates a sequence of stationary policies {µ

} and a corre-

sponding sequence of approximate cost vectors {J

} satisfying for all k

− J

k ≤ δ, kT

k+1

− T J

k ≤ , (7.2)

Stochastic Shortest Path Problems

where δ and  are some positive scalars, and k · k is the weighted sup-norm

kJk = max

i=1,...,n

J(i)

v(i)

The following proposition provides error bounds that are special cases of the ones of Props. 5.3 and

5.5.

Proposition 7.2: (Error Bound for Approximate PI) The sequence {µ

} generated by the

approximate PI algorithm (7.2) satisﬁes

k+1

− J

k ≤ αkJ

− J

k +

 + 2αδ

1 − α

, (7.3)

and

lim sup

k→∞

− J

k ≤

 + 2αδ

(1 − α)

, (7.4)

where k · k is the weighted sup-norm of Prop. 7.1, and α is the associated contraction modulus.

Moreover, when {µ

} converges to some ¯µ, in the sense that

k+1

= µ

= ¯µ for some

we have

¯µ

− J

k ≤

 + 2αδ

1 − α

The standard result in the literature for the approximate PI method of this section has the form

lim sup

k→∞

max

i=1,...,n



(x) − J

(x)



≤

n(1 − α + n)( + 2δ)

(1 − ρ)

, (7.5)

where ρ is deﬁned as the maximal (over all initial states and policies) probability of the Markov chain not

having terminated after n transitions (see [BeT96], Prop. 6.3). While the bounds (7.4) and (7.5) involve

diﬀerent norms (one is weighted and the other is unweighted) and diﬀerent denominators, the error bound

(7.4) seems stronger, particularly for large n.

7.2 Optimistic Policy Iteration

Consider the optimistic PI method, whereby starting with a vector J

∈ <

, we generate sequences {J

}

and {µ

} with the algorithm

= T J

, J

k+1

= T

, k = 0, 1, . . . , (7.6)

where {m

} is a sequence of positive integers. Then the convergence result and convergence rate estimates

of Section 6 hold. In particular, we have the following.

Stochastic Shortest Path Problems

Proposition 7.3: (Convergence of Optimistic PI) Assume that all stationary policies are

proper, and let



, µ

)



be a sequence generated by the optimistic PI algorithm (7.6), and assume

that for some c ≥ 0 we have

− T J

k ≤ c,

where k · k is the weighted sup-norm of Prop. 7.1. Then for all k ≥ 0,

1 − α

c v ≥ J

1 − α

c v ≥ J

≥ J

−

(k + 1)α

1 − α

c v, (7.7)

where v and α are the weight vector and the contraction modulus of Prop. 7.1, and β

is deﬁned by



1 if k = 0,

+···+m

k−1

if k > 0,

(7.8)

with m

, j = 0, 1, . . ., being the integers used in the algorithm (7.6). Moreover, we have

lim

k→∞

− J

k = 0,

and J

= J

for all k greater than some index

To our knowledge, this is the ﬁrst convergence result for optimistic PI applied to SSP problems (earlier

results by Williams and Baird [WiB93] for discounted MDP, may be easily extended to SSP, but require

restrictive conditions, such as T J

≤ J

). A similar result can be obtained when the mappings T and

are replaced in the algorithm (7.6) by any monotone mappings W and W

that are contractions with

respect to a common weighted sup-norm, and have J

and J

as their unique ﬁxed points, respectively.

This latter property is true in particular if W is the Gauss-Seidel mapping based on T [this is the mapping

where (W J)(i) is computed by the same equation as (T J)(i) except that the previously calculated values

(W J)(1), . . . , (W J)(i − 1) are used in place of J(1), . . . , J(i − 1)].

7.3 Approximate Optimistic Policy Iteration

Consider an optimistic approximate PI algorithm that generates a sequence of stationary policies {µ

} and

a corresponding sequence of approximate cost vectors {J

} satisfying for all k

− T

k−1

k ≤ δ, kT

k+1

− T J

k ≤ , k = 0, 1, . . . , (7.9)

[cf. Eq. (6.17)]. The following proposition provides an error bound that is a special case of the one of Prop.

6.3 (the Semilinear Monotonic Contraction 6.1 is clearly satisﬁed under our assumptions).

Conclusions

Proposition 7.4: (Error Bound for Optimistic Approximate PI) The sequence {µ

} gener-

ated by the approximate PI algorithm (7.9) satisﬁes

lim sup

k→∞

− J

k ≤

 + 2αδ

(1 − α)

. (7.10)

8. CONCLUSIONS

We have considered an abstract and broadly applicable DP model based on weighted sup-norm contractions,

and provided a review of classical results and extensions to approximation methods that are the focus

of current research. By virtue of its abstract character, the analysis provides insight into fundamental

convergence properties and error bounds of exact and approximate VI and PI algorithms. Moreover, it

allows simple proofs of results that would be diﬃcult and/or tedious to obtain by alternative methods. The

power of our analysis was illustrated by its application to SSP problems, where substantial improvements of

the existing results on PI algorithms were obtained.

9. REFERENCES

[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control: The Discrete Time Case, Academic

Press, N. Y.; may be downloaded from

http://web.mit.edu/dimitrib/www/home.html

[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic Shortest Path Problems,” Math.

Operations Research, Vol. 16, pp. 580-595.

[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming, Athena Scientiﬁc, Belmont, MA.

[BeY10a] Bertsekas, D. P., and Yu, H., 2010. “Q-Learning and Enhanced Policy Iteration in Discounted Dynamic

Programming,” Lab. for Information and Decision Systems Report LIDS-P-2831, MIT; to appear in Mathematics of

Operations Research.

[BeY10b] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy Iteration in Dynamic Programming,”

Proc. of Allerton Conf. on Information Sciences and Systems.

[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dynamic Programming,” SIAM J. on

Control and Optimization, Vol. 15, pp. 438-464.

[Ber07] Bertsekas, D. P., 2007. Dynamic Programming and Optimal Control, 3rd Edition, Vol. II, Athena Scientiﬁc,

Belmont, MA.

[Ber11] Bertsekas, D. P., 2011. “λ-Policy Iteration: A Review and a New Implementation,” Lab. for Information

and Decision Systems Report LIDS-P-2874, MIT; to appear in Reinforcement Learning and Approximate Dynamic

Programming for Feedback Control, by F. Lewis and D. Liu (eds.), IEEE Press Computational Intelligence Series.

Conclusions

[CaR11] Canbolat, P. G., and Rothblum, U. G., 2011. “(Approximate) Iterated Successive Approximations Algo-

rithm for Sequential Decision Processes,” Technical Report, The Technion - Israel Institute of Technology; Annals of

Operations Research, to appear.

[Den67] Denardo, E. V., 1967. “Contraction Mappings in the Theory Underlying Dynamic Programming,” SIAM

Review, Vol. 9, pp. 165-177.

[Har72] Harrison, J. M., 1972. “Discrete Dynamic Programming with Unbounded Rewards,” Ann. Math. Stat., Vol.

43, pp. 636-644.

[Lip73] Lippman, S. A., 1973. “Semi-Markov Decision Processes with Unbounded Rewards,” Management Sci., Vol.

21, pp. 717-731.

[Lip75] Lippman, S. A., 1975. “On Dynamic Programming with Unbounded Rewards,” Management Sci., Vol. 19,

pp. 1225-1233.

[Put94] Puterman, M. L., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming, J. Wiley,

N.Y.

[Rot79] Rothblum, U. G., 1979. “Iterated Successive Approximation for Sequential Decision Processes,” in Stochastic

Control and Optimization, by J. W. B. van Overhagen and H. C. Tijms (eds), Vrije University, Amsterdam.

[Sch11] Scherrer, B., 2011. “Performance Bounds for λ-Policy Iteration and Application to the Game of Tetris,”

INRIA Lorraine Report, France.

[Sch12] Scherrer, B., 2012. “On the Use of Non-Stationary Policies for Inﬁnite-Horizon Discounted Markov Decision

Processes,” INRIA Lorraine Report, France.

[ThS10a] Thiery, C., and Scherrer, B., 2010. “Least-Squares λ-Policy Iteration: Bias-Variance Trade-oﬀ in Control

Problems,” in ICML’10: Proc. of the 27th Annual International Conf. on Machine Learning.

[ThS10b] Thiery, C., and Scherrer, B., 2010. “Performance Bound for Approximate Optimistic Policy Iteration,”

Technical Report, INRIA.

[Tse90] Tseng, P., 1990. “Solving H-Horizon, Stationary Markov Decision Problems in Time Proportional to log(H),”

Operations Research Letters, Vol. 9, pp. 287-297.

[Vei69] Veinott, A. F., Jr., 1969. “Discrete Dynamic Programming with Sensitive Discount Optimality Criteria,” Ann.

Math. Statist., Vol. 40, pp. 1635-1660.

[VeP84] Verd’u, S., and Poor, H. V., 1984. “Backward, Forward, and Backward-Forward Dynamic Programming

Models under Commutativity Conditions,” Proc. 1984 IEEE Decision and Control Conference, Las Vegas, NE, pp.

1081-1086.

[VeP87] Verd’u, S., and Poor, H. V., 1987. “Abstract Dynamic Programming Models under Commutativity Condi-

tions,” SIAM J. on Control and Optimization, Vol. 25, pp. 990-1006.

[WiB93] Williams, R. J., and Baird, L. C., 1993. “Analysis of Some Incremental Variants of Policy Iteration: First

Steps Toward Understanding Actor-Critic Learning Systems,” Report NU-CCS-93-11, College of Computer Science,

Northeastern University, Boston, MA.

[YuB11] Yu, H., and Bertsekas, D. P., 2011. “Q-Learning and Policy Iteration Algorithms for Stochastic Shortest

Path Problems,” Lab. for Information and Decision Systems Report LIDS-P-2871, MIT; to appear in Annals of OR.

[YuB12] Yu, H., and Bertsekas, D. P., 2012. “Weighted Bellman Eqations and their Applications in Dynamic Pro-

gramming,” Lab. for Information and Decision Systems Report LIDS-P-2876, MIT.