CONTINUOUS-TIME ZERO-SUM GAMES FOR MAR-
KOV DECISION PROCESSES WITH RISK-SENSITIVE
FINITE-HORIZON COST CRITERION ON A GENERAL
STATE SPACE
Subrata Golui
(M.Sc. in Mathematics, Indian Institute of Engineering Science and Technology Shibpur). Senior Research Scholar,
Department of Mathematics, Indian Institute of Technology Guwahati. Guwahati (India).
golui@iitg.ac.in:
https://orcid.org/0000-0001-7232-562X:
Chandan Pal
(PhD in Mathematics, Indian Institute of Technology Bombay). Assistant Professor, Department of Mathematics,
Indian Institute of Technology Guwahati. Guwahati (India).
cpal@iitg.ac.in:
https://orcid.org/0000-0002-4684-0481:
Reception: 27/08/2022 Acceptance: 11/09/2022 Publication: 29/12/2022
Suggested citation:
Subrata Golui and Chanda Pal (2022). Continuous-time zero-sum games for Markov decision processes with risk-sensitive
finite-horizon cost criterio on a general state space. 3C Empresa. Investigación y pensamiento crítico,11 (2), 76-92.
https://doi.org/10.17993/3cemp.2022.110250.76-92
https://doi.org/10.17993/3cemp.2022.110250.76-92
ABSTRACT
In this manuscript, we study continuous-time risk-sensitive finite-horizon time-homogeneous zero-sum
dynamic games for controlled Markov decision processes (MDP) on a Borel space. Here, the transition
and payoff functions are extended real-valued functions. We prove the existence of the game’s value
and the uniqueness of the solution of Shapley equation under some reasonable assumptions. Moreover,
all possible saddle-point equilibria are completely characterized in the class of all admissible feedback
multi-strategies. We also provide an example to support our assumptions.
KEYWORDS
Zero-sum stochastic game, Borel state space, risk-sensitive utility, finite-horizon cost criterion, optimality
equation, saddle-point
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
76
CONTINUOUS-TIME ZERO-SUM GAMES FOR MAR-
KOV DECISION PROCESSES WITH RISK-SENSITIVE
FINITE-HORIZON COST CRITERION ON A GENERAL
STATE SPACE
Subrata Golui
(M.Sc. in Mathematics, Indian Institute of Engineering Science and Technology Shibpur). Senior Research Scholar,
Department of Mathematics, Indian Institute of Technology Guwahati. Guwahati (India).
golui@iitg.ac.in:
https://orcid.org/0000-0001-7232-562X:
Chandan Pal
(PhD in Mathematics, Indian Institute of Technology Bombay). Assistant Professor, Department of Mathematics,
Indian Institute of Technology Guwahati. Guwahati (India).
cpal@iitg.ac.in:
https://orcid.org/0000-0002-4684-0481:
Reception: 27/08/2022 Acceptance: 11/09/2022 Publication: 29/12/2022
Suggested citation:
Subrata Golui and Chanda Pal (2022). Continuous-time zero-sum games for Markov decision processes with risk-sensitive
finite-horizon cost criterio on a general state space. 3C Empresa. Investigación y pensamiento crítico,11 (2), 76-92.
https://doi.org/10.17993/3cemp.2022.110250.76-92
https://doi.org/10.17993/3cemp.2022.110250.76-92
ABSTRACT
In this manuscript, we study continuous-time risk-sensitive finite-horizon time-homogeneous zero-sum
dynamic games for controlled Markov decision processes (MDP) on a Borel space. Here, the transition
and payoff functions are extended real-valued functions. We prove the existence of the game’s value
and the uniqueness of the solution of Shapley equation under some reasonable assumptions. Moreover,
all possible saddle-point equilibria are completely characterized in the class of all admissible feedback
multi-strategies. We also provide an example to support our assumptions.
KEYWORDS
Zero-sum stochastic game, Borel state space, risk-sensitive utility, finite-horizon cost criterion, optimality
equation, saddle-point
https://doi.org/10.17993/3cemp.2022.110250.76-92
77
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
1 INTRODUCTION
In the literature of game theory, there are two types of game models: a zero-sum model and a nonzero-
sum game model. We know that in the zero-sum two-person game, one player tries to maximize his/her
payoff and another tries to minimize his/her payoff, whereas in the nonzero-sum game, both players
try to minimize their payoff. We can study game theory either in discrete-time or in continuous-time.
In continuous time, the players observe the state space continuously, whereas in discrete-time, they
observe the state space in discrete-time. Also, there are two types of game models with respect to
the risk measure; one is a risk-neutral game, and another is a risk-sensitive game. Risk-sensitive, or
“exponential of integral utility” cost criterion is popular, particularly in finance (see, e.g., Bielecki and
Pliska (1999)) since it has the property to capture the effects of more than first order (expectation)
moments of the cost.
There are large number of literatures for the risk-neutral utility cost criterion for continuous-time
controlled Markov decision processes (CTCMDPs) with different setup, see Guo (2007), Guo and
Hernandez-Lerma (2009), Guo et al. (2015), Guo et al. (2012), Guo and Piunovskiy (2011), Huang
(2018), Piunovskiy and Zhang (2011), Piunovskiy and Zhang (2014) for single controller model and
Guo and Hernandez-Lerma (2003), Guo and Hernandez-Lerma (2005), Guo and Hernandez-Lerma
(2007), wei and Chen (2016), Zhang and Guo (2012) for game models. Players want to ignore risk in
risk-neutral stochastic games because of the additive feature of this criterion. If the variance is high,
the risk-neutral criterion is not useful since there can be issues with optimal control. Regarding risk
preferences, different controllers may exhibit various perspectives. So, risk preferences are considered
by the decision-makers to be the performance criterion. Bell (1995) gave a model containing the
interpretation of risk-sensitive utility. This paper considers finite-horizon risk-sensitive two-person
zero-sum dynamic games for controlled CTMDPs with unbounded rates (transition and payoff rates)
under admissible feedback strategies. State and action spaces are considered to be Borel spaces. The
main target of this manuscript is to find the solution of the optimality equation
(6)
(Shapley equation),
to provide the proof of the existence of game’s value, and to give a proof of complete characterization
of saddle-point equilibrium.
The finite-horizon optimality criterion generally comes up in real-life scenarios. where the cost
criterion may not be risk-neutral. For finite-horizon risk-neutral CTMDPs, see Guo et al. (2015),
Huang (2018) while for the corresponding game, see Wei and Chen (2016) and its references. In this
context, for risk-sensitive finite-horizon controlled CTMDP, one can see Ghosh and Saha (2014), Guo
et al. (2019), Wei (2016), while the research for infinite-horizon risk-sensitive CTMDP are available
in, Ghosh and Saha (2014), Golui and Pal (2022), Guo and Zhang (2018), Kumar and Pal (2013),
Kumar and Pal (2015), Zhang (2017) and the references therein. At the same time the corresponding
finite/infinite-horizon dynamic games are studied in Ghosh et at. (2022), Golui and Pal (2021a), Golui
and Pal (2021b), Golui et al. (2022), wei (2019). Study on CTMDPs for risk-sensitive control on a
denumerable state space are available greatly, see Ghosh and saha (2014), Guo and Liao (2019), Guo et
al. (2019) but some times we see the countable state space dose not help to study some models specially
in chemical reactions problem, water reservoir management problem, inventory problem, cash-flow
problem, insurance problem etc. We see that the literature in controlled CTMDPs considering on
general state space is very narrow. Some exceptions for single controller are Golui and Pal (2022),
Guo et al. (2012), Guo and Zhang (2019), Pal and Pradhan (2019), Piunovskiy and Zhang (2014),
Piunovskiy and Zhang (2020) and for corresponding stochastic games are Bauerle and Rieder (2017),
Golui and Pal (2021b), Guo and Hernandez-Lerma (2007), Wei (2017). So, it is very interesting and
very important to consider the game problem in some general state space. In Guo and Zhang (2019), the
authors studied the same as in Guo et al. (2019) but on general state space, whereas in Wei (2017), the
finite-horizon risk-sensitive zero-sum game for a controlled Markov jump process with bounded costs
and unbounded transition rates was studied. Where in Ghosh et al. (2016), the authors studied dynamic
games on the infinite-horizon for controlled CTMDP by considering bounded transition and payoff rates.
However this boundedness condition is a restrictive conditon for many real life scenarios. Someone may
note queuing and population processes for the requirement of unboundedness in transition and payoff
functions. In Golui and Pal (2021a), finite-horizon continuous-time risk-sensitive zero-sum games for
https://doi.org/10.17993/3cemp.2022.110250.76-92
unbounded transition and payoff function on countable state space is considered. But the extension
of the same results to a general Borel state space were unknown to us. We solve this problem in this
paper. Here we are dealing with finite-horizon risk-sensitive dynamic games employing the unbounded
payoff and transition rates in the class of all admissible feedback strategies on some general Borel
state space, whose results were unknown until now. In this paper, we try to find the solution to the
risk-sensitive finite-horizon optimality equation and, at the same time, try to obtain the existence of
an optimal equilibrium point for this jump process. We take homogeneous game model. In Theorem
4, we prove our final results, i.e., we show that if the cost rates are real-valued functions, then the
Shapley equation (6), has a solution. The existence of optimal-point equilibria is proved by using the
measurable selection theorem in Nowak (1985). The claim of uniqueness of the solution is due to the
well known Feynman-Kac formula. The value of the game has also been established.
The remaining portions of this work are presented. Section 2 describes the model of our stochastic
game, some definitions, and the finite-horizon cost criterion. In Section 3, preliminary results, conditions,
and the extension of the Feynman-Kac formula are provided. Also, we establish the probabilistic
representation of the solution of the finite horizon optimality equation (6) there. The uniqueness of this
optimal solution as well as the game’s value are proved in section 4. We also completely characterize
the Nash equilibrium among the class of admissible Markov strategies for this game model here. In
Section 5, we verify our results with an example.
2 THE ZERO-SUM DYNAMIC GAME MODEL
First, we introduce a time-homogeneous continuous-time zero-sum dynamic game model in this section,
which contains the following:
G:= {X, U, V, (U(x)U, x X),(V(x)V,x X),q(·|x, u, v),c(x, u, v),g(x)}.(1)
Here
X
is our state space which is a Borel space and the corresponding Borel
σ
-algebra is
B
(
X
).
The action spaces are
U
and
V
for first and second players, respectively, and are considered to be
Borel spaces. Their corresponding Borel
σ
-algebras are, respectively,
B
(
U
)and
B
(
V
). For each
xX
,
the admissible action spaces are denoted by
U
(
x
)
∈B
(
U
)and
V
(
x
)
∈B
(
V
), respectively and these
spaces are assumed to be compact. Now let us define a Borel subset of
X×U×V
denoted by
K:= {(x, u, v)|xX,uU(x),v V(x)}.
Next, for any (
x, u, v
)
K
, we know that the transition rate of the CTMDPs denoted by
q
(
·|x, u, v
)
is a signed kernel on
X
such that
q
(
D|x, u, v
)
0where (
x, u, v
)
K
and
x/D
. Also,
q
(
·|x, u, v
)is
assumed to be conservative i.e., q(X|x, u, v)0, as well as stable i.e.,
q(x) := sup
uU(x),vV(x)
[qx(u, v)] <∞∀xX,(2)
qx
(
u, v
) :=
q
(
{x}|x, u, v
)
0
for all
(
x, u, v
)
K.
Our running cost is
c
, assumed to be measurable on
K
and the terminal cost is
g
, assumed to be measurable on
X
. These costs are taken to be real-valued.
The dynamic game is played as following. The players take actions continuously. At time moment
t
0, if the system’s state is
xS
, the players take their own actions
utU
(
x
)and
vtV
(
x
)
independently as their corresponding strategies. As a results the following events occurs:
the first player gets an reward at rate
c
(
x, ut,v
t
)immediately and second player gives a cost at a
rate c(x, ut,v
t); and
staying for a random time in state
x
, the system leaves the state
x
at a rate given by the quantity
qx
(
ut,v
t
), and it jumps to a set
D
,(
x/D
) with some probability determined by
q(D|x, ut,v
t)
qx(ut,v
t)
(for details, see Proposition B.8 in Guo and Hernandez-Lerma (2009), p. 205 for details).
Now suppose the system is at a new state
y
. Then the above operation is replicated till the fixed time
ˆ
T>
0. Moreover, at time
ˆ
T
if the system occupies a state
yˆ
T
, second player pays a terminal cost
g
(
yˆ
T
)
to first player.
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
78
1 INTRODUCTION
In the literature of game theory, there are two types of game models: a zero-sum model and a nonzero-
sum game model. We know that in the zero-sum two-person game, one player tries to maximize his/her
payoff and another tries to minimize his/her payoff, whereas in the nonzero-sum game, both players
try to minimize their payoff. We can study game theory either in discrete-time or in continuous-time.
In continuous time, the players observe the state space continuously, whereas in discrete-time, they
observe the state space in discrete-time. Also, there are two types of game models with respect to
the risk measure; one is a risk-neutral game, and another is a risk-sensitive game. Risk-sensitive, or
“exponential of integral utility” cost criterion is popular, particularly in finance (see, e.g., Bielecki and
Pliska (1999)) since it has the property to capture the effects of more than first order (expectation)
moments of the cost.
There are large number of literatures for the risk-neutral utility cost criterion for continuous-time
controlled Markov decision processes (CTCMDPs) with different setup, see Guo (2007), Guo and
Hernandez-Lerma (2009), Guo et al. (2015), Guo et al. (2012), Guo and Piunovskiy (2011), Huang
(2018), Piunovskiy and Zhang (2011), Piunovskiy and Zhang (2014) for single controller model and
Guo and Hernandez-Lerma (2003), Guo and Hernandez-Lerma (2005), Guo and Hernandez-Lerma
(2007), wei and Chen (2016), Zhang and Guo (2012) for game models. Players want to ignore risk in
risk-neutral stochastic games because of the additive feature of this criterion. If the variance is high,
the risk-neutral criterion is not useful since there can be issues with optimal control. Regarding risk
preferences, different controllers may exhibit various perspectives. So, risk preferences are considered
by the decision-makers to be the performance criterion. Bell (1995) gave a model containing the
interpretation of risk-sensitive utility. This paper considers finite-horizon risk-sensitive two-person
zero-sum dynamic games for controlled CTMDPs with unbounded rates (transition and payoff rates)
under admissible feedback strategies. State and action spaces are considered to be Borel spaces. The
main target of this manuscript is to find the solution of the optimality equation
(6)
(Shapley equation),
to provide the proof of the existence of game’s value, and to give a proof of complete characterization
of saddle-point equilibrium.
The finite-horizon optimality criterion generally comes up in real-life scenarios. where the cost
criterion may not be risk-neutral. For finite-horizon risk-neutral CTMDPs, see Guo et al. (2015),
Huang (2018) while for the corresponding game, see Wei and Chen (2016) and its references. In this
context, for risk-sensitive finite-horizon controlled CTMDP, one can see Ghosh and Saha (2014), Guo
et al. (2019), Wei (2016), while the research for infinite-horizon risk-sensitive CTMDP are available
in, Ghosh and Saha (2014), Golui and Pal (2022), Guo and Zhang (2018), Kumar and Pal (2013),
Kumar and Pal (2015), Zhang (2017) and the references therein. At the same time the corresponding
finite/infinite-horizon dynamic games are studied in Ghosh et at. (2022), Golui and Pal (2021a), Golui
and Pal (2021b), Golui et al. (2022), wei (2019). Study on CTMDPs for risk-sensitive control on a
denumerable state space are available greatly, see Ghosh and saha (2014), Guo and Liao (2019), Guo et
al. (2019) but some times we see the countable state space dose not help to study some models specially
in chemical reactions problem, water reservoir management problem, inventory problem, cash-flow
problem, insurance problem etc. We see that the literature in controlled CTMDPs considering on
general state space is very narrow. Some exceptions for single controller are Golui and Pal (2022),
Guo et al. (2012), Guo and Zhang (2019), Pal and Pradhan (2019), Piunovskiy and Zhang (2014),
Piunovskiy and Zhang (2020) and for corresponding stochastic games are Bauerle and Rieder (2017),
Golui and Pal (2021b), Guo and Hernandez-Lerma (2007), Wei (2017). So, it is very interesting and
very important to consider the game problem in some general state space. In Guo and Zhang (2019), the
authors studied the same as in Guo et al. (2019) but on general state space, whereas in Wei (2017), the
finite-horizon risk-sensitive zero-sum game for a controlled Markov jump process with bounded costs
and unbounded transition rates was studied. Where in Ghosh et al. (2016), the authors studied dynamic
games on the infinite-horizon for controlled CTMDP by considering bounded transition and payoff rates.
However this boundedness condition is a restrictive conditon for many real life scenarios. Someone may
note queuing and population processes for the requirement of unboundedness in transition and payoff
functions. In Golui and Pal (2021a), finite-horizon continuous-time risk-sensitive zero-sum games for
https://doi.org/10.17993/3cemp.2022.110250.76-92
unbounded transition and payoff function on countable state space is considered. But the extension
of the same results to a general Borel state space were unknown to us. We solve this problem in this
paper. Here we are dealing with finite-horizon risk-sensitive dynamic games employing the unbounded
payoff and transition rates in the class of all admissible feedback strategies on some general Borel
state space, whose results were unknown until now. In this paper, we try to find the solution to the
risk-sensitive finite-horizon optimality equation and, at the same time, try to obtain the existence of
an optimal equilibrium point for this jump process. We take homogeneous game model. In Theorem
4, we prove our final results, i.e., we show that if the cost rates are real-valued functions, then the
Shapley equation (6), has a solution. The existence of optimal-point equilibria is proved by using the
measurable selection theorem in Nowak (1985). The claim of uniqueness of the solution is due to the
well known Feynman-Kac formula. The value of the game has also been established.
The remaining portions of this work are presented. Section 2 describes the model of our stochastic
game, some definitions, and the finite-horizon cost criterion. In Section 3, preliminary results, conditions,
and the extension of the Feynman-Kac formula are provided. Also, we establish the probabilistic
representation of the solution of the finite horizon optimality equation (6) there. The uniqueness of this
optimal solution as well as the game’s value are proved in section 4. We also completely characterize
the Nash equilibrium among the class of admissible Markov strategies for this game model here. In
Section 5, we verify our results with an example.
2 THE ZERO-SUM DYNAMIC GAME MODEL
First, we introduce a time-homogeneous continuous-time zero-sum dynamic game model in this section,
which contains the following:
G:= {X, U, V, (U(x)U, x X),(V(x)V,x X),q(·|x, u, v),c(x, u, v),g(x)}.(1)
Here
X
is our state space which is a Borel space and the corresponding Borel
σ
-algebra is
B
(
X
).
The action spaces are
U
and
V
for first and second players, respectively, and are considered to be
Borel spaces. Their corresponding Borel
σ
-algebras are, respectively,
B
(
U
)and
B
(
V
). For each
xX
,
the admissible action spaces are denoted by
U
(
x
)
∈B
(
U
)and
V
(
x
)
∈B
(
V
), respectively and these
spaces are assumed to be compact. Now let us define a Borel subset of
X×U×V
denoted by
K:= {(x, u, v)|xX,uU(x),v V(x)}.
Next, for any (
x, u, v
)
K
, we know that the transition rate of the CTMDPs denoted by
q
(
·|x, u, v
)
is a signed kernel on
X
such that
q
(
D|x, u, v
)
0where (
x, u, v
)
K
and
x/D
. Also,
q
(
·|x, u, v
)is
assumed to be conservative i.e., q(X|x, u, v)0, as well as stable i.e.,
q(x) := sup
uU(x),vV(x)
[qx(u, v)] <∞∀xX,(2)
qx
(
u, v
) :=
q
(
{x}|x, u, v
)
0
for all
(
x, u, v
)
K.
Our running cost is
c
, assumed to be measurable on
K
and the terminal cost is
g
, assumed to be measurable on
X
. These costs are taken to be real-valued.
The dynamic game is played as following. The players take actions continuously. At time moment
t
0, if the system’s state is
xS
, the players take their own actions
utU
(
x
)and
vtV
(
x
)
independently as their corresponding strategies. As a results the following events occurs:
the first player gets an reward at rate
c
(
x, ut,v
t
)immediately and second player gives a cost at a
rate c(x, ut,v
t); and
staying for a random time in state
x
, the system leaves the state
x
at a rate given by the quantity
qx
(
ut,v
t
), and it jumps to a set
D
,(
x/D
) with some probability determined by
q(D|x, ut,v
t)
qx(ut,v
t)
(for details, see Proposition B.8 in Guo and Hernandez-Lerma (2009), p. 205 for details).
Now suppose the system is at a new state
y
. Then the above operation is replicated till the fixed time
ˆ
T>
0. Moreover, at time
ˆ
T
if the system occupies a state
yˆ
T
, second player pays a terminal cost
g
(
yˆ
T
)
to first player.
https://doi.org/10.17993/3cemp.2022.110250.76-92
79
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
Consequently, first player always tries to maximize his/her payoff, whereas second player wants
to minimize his/her payoff according to some cost measurement criterion
H·,·
(
·,·
), that is presented
below by equation (4). Next the construction of the CTMDPs will be presented under possibly pair of
admissible feedback strategies. For construction of the corresponding CTMDPs (as in Kitaev (1986),
Piunovskiy and Zhang (2011)), we imopse some usefull notations: define
X
(∆) :=
X∪{
}
(for some
/X),
0
:= (
X×
(0
,
))
, :=
0∪{
(
x0
1,x
1,···
ˆ
k,x
ˆ
k,,
,,
,···
)
|x0X,x
lX
l
(0
,
)
,
for each
1
lˆ
k, ˆ
k
1
}
, and suppose
F
be the corresponding Borel
σ
-algebra on . Then we get
a Borel measurable space (Ω
,F
). For each
ˆ
k
0,
ω
:= (
x0
1,x
1,···
ˆ
k,x
ˆ
k,···
)
,
let us define
T0
(
ω
) := 0,
Tˆ
k
(
ω
)
Tˆ
k1
(
ω
) :=
θˆ
k
,
T
(
ω
) :=
limˆ
k→∞ Tˆ
k
(
ω
). Now in view of the definition of
{Tˆ
k}
, we
define the state process {ξt}t0defined by
ξt(ω) :=
ˆ
k0
I{Tˆ
kt<Tˆ
k+1}xˆ
k+I{tT},t0.(3)
Here
IE
is the standard notation for indicator function corresponding to a set
E
, and we use 0+
z
=:
z
and 0
z
=: 0 for any
zX
(∆) as convention. The process after the time
T
is treated for absorbtion in
the state . Hence, let us define
q
(
·|
,u
,v
):
0,
U
(∆) :=
UU
,
V
(∆) :=
VV
,
U
:=
{u}
,
V
:=
{v}
,
c
(∆
, u, v
):
0for all (
u, v
)
U
(∆)
×V
(∆),
u
,
v
are treated as isolated points.
Furthermore, define
Ft
:=
σ
(
{Tˆ
ks, ξTˆ
kD}
:
D∈B
(
X
)
,
0
st, ˆ
k
0)
tR+
, and
Fs
=:
0t<s Ft
. Lastly the
σ
-algebra of predictable sets on
×
[0
,
)corresponding to
{Ft}t0
is
denoted by
P
:=
σ
(
{U×{
0
},U F0}∪{V×
(
s,
)
,V Fs}
). Now we intoduce strategies of players
to define the risk sensitive cost criterion:
Definition 1. An admissible feedback strategy for player 1, denoted by
ζ1
=
{ζ1
t}t0
, is defined to be a
transition probability
ζ1
(
du|ω, t
)from (Ω
×
[0
,
)
,P
)onto (
U,B
(
U
)), for which
ζ1
(
U
(
ξt
(
ω
))
|ω, t
)=
1.
For more informations, one can see [Guo and Song (2011), Definition 2.1, Remark 2.2], Piunovskiy
and Zhang (2011), Zhang (2017).
Let Π
1
Ad
denote the set of all admissible feedback strategies for player 1. A strategy
ζ1
Π
1
Ad
for
player 1, is said to be Markov if for every
ω
and
t
0the relation
ζ1
(
du|ω, t
)=
ζ1
(
du|ξt
(
ω
)
,t
)
holds,
limstξs
(
ω
) :=
ξt
(
w
). We call a Markov strategy
{ζ1
t}
as a stationary Markov for player 1, if it
not explicitly dependent on time
t
. The family of all Markov strategies and all stationary strategies are
denoted by Π
1
M
and Π
1
SM
, respectively, for first player. The sets Π
2
Ad
,Π
2
M
,Π
2
SM
stand for all admissible
feedback strategies, all Markov strategies, and all stationary strategies, respectively, for second player
are defined similarly. In view of Assumption 1, below, for any initial distribution
γ
on
X
and any
multi-strategy (
ζ12
)
Π
1
Ad ×
Π
2
Ad
, in view of Theorem 4.27 in Kitaev and Rykov (1985) a unique
probability measure exists and denoted by
Pζ12
γ
(depending on
γ
and (
ζ12
)) on (Ω
,F
)for which
Pζ12
γ
(
ξ0
=
x
)=1. Let us define the corresponding expectation operator as
Eζ12
γ
. Particularly, when
γ
represents the Dirac measure at a state
xX
,
Pζ12
γ
and
Eζ12
γ
will be written as
Pζ12
x
and
Eζ12
x
,
respectively. For any compact metric space
Y
, the space of probability measures on
Y
is denoted by
P
(
Y
)with Prohorov topology. As
U
(
x
)and
V
(
x
)are compact sets for each
xX
,
P
(
U
(
x
)) and
P
(
V
(
x
)) are also compact and convex metric spaces. Now for each fixed
xX
,
ϑ∈P
(
U
(
x
)) and
η∈P(V(x)), the corresponding transition and payoff rates are defined, as below:
q(D|x, ϑ, η) := V(x)U(x)
q(D|x, u, v)ϑ(du)η(dv),DX.
c(x, ϑ, η) := V(x)U(x)
c(x, u, v)ϑ(du)η(dv),
Note that
ζ1Π1
SM
can be identified by a mapping
ζ1
:
X→P
(
U
)for which
ζ1
(
·|x
)
∈P
(
U
(
x
))
for each
xX
. So, we can write Π
1
SM
xSP
(
U
(
x
)) and Π
2
SM
xXP
(
V
(
x
)). So, the sets
Π1
SM
and Π2
SM are compact metric spaces by using Tychonoff theorem.
https://doi.org/10.17993/3cemp.2022.110250.76-92
Next take
λ
(0
,
1] as a fixed risk-sensitivity coefficient and fix a finite time horizon
ˆ
T>
0. Then
for each
xX
,
t
[0
,ˆ
T
]and (
ζ12
)
Π
1
Ad ×
Π
2
Ad
, define the risk-sensitive finite-horizon (
ˆ
T
-horizon)
cost criterion as
Hζ12(0,x) := Eζ12
xeλˆ
T
0VUc(ξt,u,v)ζ1(da|ω,t)ζ2(dv|ω,t)dt+λg(ξˆ
T),(4)
whence it is given that the integral is well defined. For each (
ζ12
)
Π
1
M×
Π
2
M
, we know that
{ξt,
0
}
is a controlled Markov Process on (Ω
,F,Pζ12
γ
), and hence for any
γ
(initial distribution on
X
), for
each xX,t[0,ˆ
T],
Hζ12(t, x) := Eζ12
γeλˆ
T
tVUc(ξt,u,v)ζ1(du|ξt,t)ζ2(dv|ξt,t)dt+λg(ξˆ
T)|ξt=x,(5)
is well defined.
We define the lower value of the game on Xas L(x) := sup
ζ2Π2
Ad
inf
ζ1Π1
Ad Hζ12(0,x).
Similarly, define the upper value of the game on Xas U(x) := inf
ζ1Π1
Ad
sup
ζ2Π2
Ad Hζ12(0,x).
It is easy to see that
L(x)≤U(x)for each xX.
If
L
(
x
)=
U
(
x
),
xX
, define
L
(
·
)
≡U
(
·
):
≡H
(
·
), and then the function
H
(
x
)is called the value of
the game. Also, if
sup
ζ2Π2
M
inf
ζ1Π1
MHζ12
(
t, x
)=
inf
ζ1Π1
M
sup
ζ2Π2
MHζ12
(
t, x
),
(
t, x
)
[0
,ˆ
T
]
×X
, the common
function is denoted by H(·,·).
A strategy ζ1Π1
Ad is called optimal for first player if
Hζ12(x, c)sup
ζ2Π2
Ad
inf
ζ1Π1
Ad Hζ12(x)=L(x)xX,ζ2Π2
Ad.
Similarly, for second player, the strategy ζ2Π2
Ad is optimal if
Hζ12(x, c)inf
ζ1Π1
Ad
sup
π2Π2
Ad Hζ12(x)=U(x)xX,ζ1Π1
Ad.
If for
kth
player, (k=1,2),
ζk
Π
k
Ad
is optimal, then (
ζ12
)is said to be a pair of optimal strategies.
Now for the pair of strategies (ζ12)if
Hζ12(x, c)≤H
ζ12(x, c)≤H
ζ12(x, c),ζ1Π1
Ad,ζ2Π2
Ad,
then (
ζ12
)is said to a saddle-point equilibrium, and then the strategies
ζ1
and
ζ2
are optimal
strategies corresponding to first player and second player, respectively.
3 PRELIMS
For proving the existence of an optimal pair of strategies, we recall some statndard results for the
risk-sensitive finite time horizon CTMDPs. Due to the unboundedness of the rates
q
(
dy|x, u, v
)and
c
(
x, u, v
), we impose some conditions to make the processes
{ξt,t
0
}
nonexplosive, and to make
Hπ12
(0
,x
)finite, which were used greatly in CTMDPs; see, Golui and Pal (2021a), Guo and Liao
(2019), Guo et al. (2019), Guo and Zhang (2019) and references therein. For bounded rates, following
Assumption 1 (ii)-(iii) are not required, see Ghosh and Saha (2014), Kumar and Pal (2015).
Assumption 1. There exists a function W:X[1,)for which the followings hold:
(i)
The relation
SW
(
y
)
q
(
dy|x, u, v
)
ρ1W
(
x
)+
b1
holds, for each (
x, u, v
)
K
, for some constants
ρ1>0,b10;
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
80
Consequently, first player always tries to maximize his/her payoff, whereas second player wants
to minimize his/her payoff according to some cost measurement criterion
H·,·
(
·,·
), that is presented
below by equation (4). Next the construction of the CTMDPs will be presented under possibly pair of
admissible feedback strategies. For construction of the corresponding CTMDPs (as in Kitaev (1986),
Piunovskiy and Zhang (2011)), we imopse some usefull notations: define
X
(∆) :=
X∪{
}
(for some
/X),
0
:= (
X×
(0
,
))
, :=
0∪{
(
x0
1,x
1,···
ˆ
k,x
ˆ
k,,
,,
,···
)
|x0X,x
lX
l
(0
,
)
,
for each
1
lˆ
k, ˆ
k
1
}
, and suppose
F
be the corresponding Borel
σ
-algebra on . Then we get
a Borel measurable space (Ω
,F
). For each
ˆ
k
0,
ω
:= (
x0
1,x
1,···
ˆ
k,x
ˆ
k,···
)
,
let us define
T0
(
ω
) := 0,
Tˆ
k
(
ω
)
Tˆ
k1
(
ω
) :=
θˆ
k
,
T
(
ω
) :=
limˆ
k Tˆ
k
(
ω
). Now in view of the definition of
{Tˆ
k}
, we
define the state process {ξt}t0defined by
ξt(ω) :=
ˆ
k0
I{Tˆ
kt<Tˆ
k+1}xˆ
k+I{tT},t0.(3)
Here
IE
is the standard notation for indicator function corresponding to a set
E
, and we use 0+
z
=:
z
and 0
z
=: 0 for any
zX
(∆) as convention. The process after the time
T
is treated for absorbtion in
the state . Hence, let us define
q
(
·|
,u
,v
):
0,
U
(∆) :=
UU
,
V
(∆) :=
VV
,
U
:=
{u}
,
V
:=
{v}
,
c
(∆
, u, v
):
0for all (
u, v
)
U
(∆)
×V
(∆),
u
,
v
are treated as isolated points.
Furthermore, define
Ft
:=
σ
(
{Tˆ
ks, ξTˆ
kD}
:
D∈B
(
X
)
,
0
st, ˆ
k
0)
tR+
, and
Fs
=:
0t<s Ft
. Lastly the
σ
-algebra of predictable sets on
×
[0
,
)corresponding to
{Ft}t0
is
denoted by
P
:=
σ
(
{U×{
0
},U F0}∪{V×
(
s,
)
,V Fs}
). Now we intoduce strategies of players
to define the risk sensitive cost criterion:
Definition 1. An admissible feedback strategy for player 1, denoted by
ζ1
=
{ζ1
t}t0
, is defined to be a
transition probability
ζ1
(
du|ω, t
)from (Ω
×
[0
,
)
,P
)onto (
U,B
(
U
)), for which
ζ1
(
U
(
ξt
(
ω
))
|ω, t
)=
1.
For more informations, one can see [Guo and Song (2011), Definition 2.1, Remark 2.2], Piunovskiy
and Zhang (2011), Zhang (2017).
Let Π
1
Ad
denote the set of all admissible feedback strategies for player 1. A strategy
ζ1
Π
1
Ad
for
player 1, is said to be Markov if for every
ω
and
t
0the relation
ζ1
(
du|ω, t
)=
ζ1
(
du|ξt
(
ω
)
,t
)
holds,
limstξs
(
ω
) :=
ξt
(
w
). We call a Markov strategy
{ζ1
t}
as a stationary Markov for player 1, if it
not explicitly dependent on time
t
. The family of all Markov strategies and all stationary strategies are
denoted by Π
1
M
and Π
1
SM
, respectively, for first player. The sets Π
2
Ad
,Π
2
M
,Π
2
SM
stand for all admissible
feedback strategies, all Markov strategies, and all stationary strategies, respectively, for second player
are defined similarly. In view of Assumption 1, below, for any initial distribution
γ
on
X
and any
multi-strategy (
ζ12
)
Π
1
Ad ×
Π
2
Ad
, in view of Theorem 4.27 in Kitaev and Rykov (1985) a unique
probability measure exists and denoted by
Pζ12
γ
(depending on
γ
and (
ζ12
)) on (Ω
,F
)for which
Pζ12
γ
(
ξ0
=
x
)=1. Let us define the corresponding expectation operator as
Eζ12
γ
. Particularly, when
γ
represents the Dirac measure at a state
xX
,
Pζ12
γ
and
Eζ12
γ
will be written as
Pζ12
x
and
Eζ12
x
,
respectively. For any compact metric space
Y
, the space of probability measures on
Y
is denoted by
P
(
Y
)with Prohorov topology. As
U
(
x
)and
V
(
x
)are compact sets for each
xX
,
P
(
U
(
x
)) and
P
(
V
(
x
)) are also compact and convex metric spaces. Now for each fixed
xX
,
ϑ∈P
(
U
(
x
)) and
η∈P(V(x)), the corresponding transition and payoff rates are defined, as below:
q(D|x, ϑ, η) := V(x)U(x)
q(D|x, u, v)ϑ(du)η(dv),DX.
c(x, ϑ, η) := V(x)U(x)
c(x, u, v)ϑ(du)η(dv),
Note that
ζ1Π1
SM
can be identified by a mapping
ζ1
:
X→P
(
U
)for which
ζ1
(
·|x
)
∈P
(
U
(
x
))
for each
xX
. So, we can write Π
1
SM
xSP
(
U
(
x
)) and Π
2
SM
xXP
(
V
(
x
)). So, the sets
Π1
SM
and Π2
SM are compact metric spaces by using Tychonoff theorem.
https://doi.org/10.17993/3cemp.2022.110250.76-92
Next take
λ
(0
,
1] as a fixed risk-sensitivity coefficient and fix a finite time horizon
ˆ
T>
0. Then
for each
xX
,
t
[0
,ˆ
T
]and (
ζ12
)
Π
1
Ad ×
Π
2
Ad
, define the risk-sensitive finite-horizon (
ˆ
T
-horizon)
cost criterion as
Hζ12(0,x) := Eζ12
xeλˆ
T
0VUc(ξt,u,v)ζ1(da|ω,t)ζ2(dv|ω,t)dt+λg(ξˆ
T),(4)
whence it is given that the integral is well defined. For each (
ζ12
)
Π
1
M×
Π
2
M
, we know that
{ξt,
0
}
is a controlled Markov Process on (Ω
,F,Pζ12
γ
), and hence for any
γ
(initial distribution on
X
), for
each xX,t[0,ˆ
T],
Hζ12(t, x) := Eζ12
γeλˆ
T
tVUc(ξt,u,v)ζ1(du|ξt,t)ζ2(dv|ξt,t)dt+λg(ξˆ
T)|ξt=x,(5)
is well defined.
We define the lower value of the game on Xas L(x) := sup
ζ2Π2
Ad
inf
ζ1Π1
Ad Hζ12(0,x).
Similarly, define the upper value of the game on Xas U(x) := inf
ζ1Π1
Ad
sup
ζ2Π2
Ad Hζ12(0,x).
It is easy to see that
L(x)≤U(x)for each xX.
If
L
(
x
)=
U
(
x
),
xX
, define
L
(
·
)
≡U
(
·
):
≡H
(
·
), and then the function
H
(
x
)is called the value of
the game. Also, if
sup
ζ2Π2
M
inf
ζ1Π1
MHζ12
(
t, x
)=
inf
ζ1Π1
M
sup
ζ2Π2
MHζ12
(
t, x
),
(
t, x
)
[0
,ˆ
T
]
×X
, the common
function is denoted by H(·,·).
A strategy ζ1Π1
Ad is called optimal for first player if
Hζ12(x, c)sup
ζ2Π2
Ad
inf
ζ1Π1
Ad Hζ12(x)=L(x)xX,ζ2Π2
Ad.
Similarly, for second player, the strategy ζ2Π2
Ad is optimal if
Hζ12(x, c)inf
ζ1Π1
Ad
sup
π2Π2
Ad Hζ12(x)=U(x)xX,ζ1Π1
Ad.
If for
kth
player, (k=1,2),
ζk
Π
k
Ad
is optimal, then (
ζ12
)is said to be a pair of optimal strategies.
Now for the pair of strategies (ζ12)if
Hζ12(x, c)≤H
ζ12(x, c)≤H
ζ12(x, c),ζ1Π1
Ad,ζ2Π2
Ad,
then (
ζ12
)is said to a saddle-point equilibrium, and then the strategies
ζ1
and
ζ2
are optimal
strategies corresponding to first player and second player, respectively.
3 PRELIMS
For proving the existence of an optimal pair of strategies, we recall some statndard results for the
risk-sensitive finite time horizon CTMDPs. Due to the unboundedness of the rates
q
(
dy|x, u, v
)and
c
(
x, u, v
), we impose some conditions to make the processes
{ξt,t
0
}
nonexplosive, and to make
Hπ12
(0
,x
)finite, which were used greatly in CTMDPs; see, Golui and Pal (2021a), Guo and Liao
(2019), Guo et al. (2019), Guo and Zhang (2019) and references therein. For bounded rates, following
Assumption 1 (ii)-(iii) are not required, see Ghosh and Saha (2014), Kumar and Pal (2015).
Assumption 1. There exists a function W:X[1,)for which the followings hold:
(i)
The relation
SW
(
y
)
q
(
dy|x, u, v
)
ρ1W
(
x
)+
b1
holds, for each (
x, u, v
)
K
, for some constants
ρ1>0,b10;
https://doi.org/10.17993/3cemp.2022.110250.76-92
81
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
(ii) q(x)M1W(x),xX, for some nonnegative constant M11, where q(x)is as in (2.2);
(iii) e2( ˆ
T+1)λ|c(x,u,v)|M2W
(
x
)for any (
x, u, v
)
K,
and
e2( ˆ
T+1)λ|g(x)|M2W
(
x
)for each
xX
,
for some constant M21.
The non-exlposion of the state process
{ξt,t
0
}
and the finiteness of
Hζ12
(0
,x
)is shown in
following Lemma. Here we see that the function
Hζ12
(0
,x
)has upper and lower bound in terms of the
function W.
Lemma 1. We grant Assumption 1. Then for each (
ζ12
)
Π
1
Ad ×
Π
2
Ad,
we obtain the following
results.
(a) Pζ12
x(T=)=1,Pζ12
x(ξtX)=1, and Pζ12
x(ξ0=x)=1for each t0and xX.
(b) (b1)eL1W(x)≤H
ζ12
(0
,x
)
L1W
(
x
)for
xX
and (
ζ12
)
Π
1
Ad ×
Π
2
Ad
, where
L1
:=
M2eρ1ˆ
T1+ b1
ρ1.
(b2)eL1W(x)≤H
ζ12(t, x)L1W(x)for (t, x)[0,ˆ
T]×Xand (ζ12)Π1
M×Π2
M.
Proof. These results can be proved by using Guo et al. (2019), Lemma 3.1 and Guo and Zhang
(2019), Lemma 3.1.
In order to apply the extended Feynman-Kac formula, we impose the following assumption for
unbounded functions. If the rates are bounded, the following Assumption is not required, see Ghosh
and Saha (2014). Since we are dealing with unbounded rates, we require the following condition.
Assumption 2. There exists [1,)-valued function W1on Xsuch that
(i) XW2
1
(
y
)
q
(
dy|x, u, v
)
ρ2W2
1
(
x
)+
b2
, for each (
x, u, v
)
K
for some constants
ρ2>
0and
b2>0;
(ii) W2
(
x
)
M3W1
(
x
),
xX
, for some constant
M3
1, where the function
W
is as in
Assumption 1.
In addition of Assumptions 1, 2, we impose the following conditions to guarantee the existence of a
pair of optimal strategies.
Assumption 3. (i)
The cost and transition rate functions,
c
(
x, u, v
)and
q
(
·|x, u, v
)are continuous
on U(x)×V(x), for each xX.
(ii)
The integral functions
X
f
(
y
)
q
(
dy|x, u, v
)and
XW
(
y
)
q
(
dy|x, u, v
)are continuous on
U
(
x
)
×
V
(
x
), for each
xX,
for all bounded measurable functions
f
on
X
and
W
as previous in
Assumption 1.
We next introduce some useful notations. Let
Ac
(Ω
×
[0
,ˆ
T
]
×X
)denote the space of all real-valued,
P×B
(
X
)-measurable functions
φ
(
ω,t,x
)which are differentiable in
t
[0
,ˆ
T
]a.e. i.e.,
Ac
(Ω
×
[0
,ˆ
T
]
×X
)
contains the said measurable functions
φ
with the following facets: Given any
xX
,(
ζ12
)
Π
1
Ad ×
Π
2
Ad
, and a.s.
ω
, there exists a
E(φ,ω,x,ζ12)
[0
,ˆ
T
](a Borel subset of [0
,ˆ
T
]that depends
on
φ,x12
) such that
∂φ
∂t
(the partial derivative with respect to time
t
[0
,ˆ
T
]) exists for every
tE(φ,ω,x,ζ12)
and
mL
(
Ec
(φ,ω,x,ζ12)
)=0, where
mL
is the Lebesgue measure on
R
. Now if for some
(
ω, t, x
)
×
[0
,ˆ
T
]
×X
,
∂φ
∂t
(
ω, t, x
)does not exists, we take this as any real number, and so
∂φ
∂t
(
·,·,·
)
can be made definable on
×
[0
,ˆ
T
]
×X
. For any given function
W
1on
X
, a function
f
(real-valued)
on
×
[0
,ˆ
T
]
×X
is said to be a
W
-bounded if
f
W
:=
sup(ω,t,x)×[0,ˆ
T]×X
|f(ω,t,x)|
W(x)<
. The
W
-bounded Banach space is denoted by
BW
(Ω
×
[0
,ˆ
T
]
×X
). Note that if
W
1,
B1
(Ω
×
[0
,ˆ
T
]
×X
)
is the space of all bounded functions on ×[0,ˆ
T]×X.
https://doi.org/10.17993/3cemp.2022.110250.76-92
Now define
C1
W0,W1
(Ω
×
[0
,ˆ
T
]
×X
) :=
{ψBW0
(Ω
×
[0
,ˆ
T
]
×X
)
Ac
(Ω
×
[0
,ˆ
T
]
×X
):
∂ψ
∂t
BW1
(Ω
×
[0
,ˆ
T
]
×X
)
}
. If any function
ψ
(
ω, t, x
)
C1
W0,W1
(Ω
×
[0
,ˆ
T
]
×X
)does not depend on
ω
, we
write it as ψ(t, x)and the corresponding space is C1
W0,W1([0,ˆ
T]×X).
In the the next theorem, we state the extended Feynman-Kac formula, which is very useful for us.
Theorem 1. We grant Assumptions 1 and 2.
(a) Then, for each xX,(ζ12)Π1
Ad ×Π2
Ad and ψC1
W,W1(Ω ×[0,ˆ
T]×X),
Eζ12
xˆ
T
0∂ψ
∂t (ω, t, ξt)+X
ψ(ω, t, y)VU
q(dy|ξt, u, v)ζ1(du|ω, t)ζ2(dv|ω, t)dt
=Eζ12
x[ψ(ω, ˆ
T,ξˆ
T)] Eζ12
xψ(ω, 0,x).
Note that since (
ζ12
)
Π
1
Ad ×
Π
2
Ad
may be dependent on histories,
{ξt,t
0
}
may be not
Markovian.
(b) For each xX,(ζ12)Π1
M×Π2
Mand ψC1
W,W1([0,ˆ
T]×X),
Eζ12
γˆ
T
s∂ψ
∂t (t, ξt)+λc(ξt1
t2
t)et
sλc(ξβ1
β2
β)ψ(t, ξt)
+X
et
sλc(ξβ1
β2
β)ψ(t, y)q(dy|ξt1
t2
t)dt|ξs=x
=Eζ12
γeˆ
T
sλc(ξβ1
β2
β)ψ(ˆ
T,ξˆ
T)|ξs=xψ(s, x).
Proof. See Guo and Zhang (2019), Theorem 3.1.
Next, we present a theorem which shows that the solutions of the optimality equations (Shapley
equations) have unique probabilistic representations. In section 4, we also illustrate how this verification
theorem can be used to determine the game’s value.
Theorem 2. Assume that Assumptions 1 and 2 are true. If there exist a function
ψC1
W,W1
([0
,ˆ
T
]
×X
)
and a pair of stationary strategies (ζ12)Π1
SM ×Π2
SM for which
ψ(s, x)eλg(x)
=E1=ˆ
T
s
sup
ϑ∈P(U(x))
inf
η∈P(V(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η)dt
=E2=ˆ
T
s
inf
η∈P(V(x)) sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η)dt
=ˆ
T
s
inf
η∈P(V(x))λc(x, ζ1(·|x, t))ψ(t, x)+X
ψ(t, y)q(dy|x, ζ1(·|x, t))dt
=ˆ
T
s
sup
ϑ∈P(U(x))λc(x, ϑ, ζ2(·|x, t))ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, ζ2(·|x, t))dt
s[0,ˆ
T],xX,(6)
then
(a)
ψ(0,x) = sup
ζ1Π1
Ad
inf
ζ2Π2
Ad Hζ12(0,x) = inf
ζ2Π2
Ad
sup
ζ1Π1
Ad Hζ12(0,x)
= inf
ζ2Π2
Ad Hζ12(0,x) = sup
ζ1Π1
Ad Hζ12(0,x),xX(7)
and
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
82
(ii) q(x)M1W(x),xX, for some nonnegative constant M11, where q(x)is as in (2.2);
(iii) e2( ˆ
T+1)λ|c(x,u,v)|M2W
(
x
)for any (
x, u, v
)
K,
and
e2( ˆ
T+1)λ|g(x)|M2W
(
x
)for each
xX
,
for some constant M21.
The non-exlposion of the state process
{ξt,t
0
}
and the finiteness of
Hζ12
(0
,x
)is shown in
following Lemma. Here we see that the function
Hζ12
(0
,x
)has upper and lower bound in terms of the
function W.
Lemma 1. We grant Assumption 1. Then for each (
ζ12
)
Π
1
Ad ×
Π
2
Ad,
we obtain the following
results.
(a) Pζ12
x(T=)=1,Pζ12
x(ξtX)=1, and Pζ12
x(ξ0=x)=1for each t0and xX.
(b) (b1)eL1W(x)≤H
ζ12
(0
,x
)
L1W
(
x
)for
xX
and (
ζ12
)
Π
1
Ad ×
Π
2
Ad
, where
L1
:=
M2eρ1ˆ
T1+ b1
ρ1.
(b2)eL1W(x)≤H
ζ12(t, x)L1W(x)for (t, x)[0,ˆ
T]×Xand (ζ12)Π1
M×Π2
M.
Proof. These results can be proved by using Guo et al. (2019), Lemma 3.1 and Guo and Zhang
(2019), Lemma 3.1.
In order to apply the extended Feynman-Kac formula, we impose the following assumption for
unbounded functions. If the rates are bounded, the following Assumption is not required, see Ghosh
and Saha (2014). Since we are dealing with unbounded rates, we require the following condition.
Assumption 2. There exists [1,)-valued function W1on Xsuch that
(i) XW2
1
(
y
)
q
(
dy|x, u, v
)
ρ2W2
1
(
x
)+
b2
, for each (
x, u, v
)
K
for some constants
ρ2>
0and
b2>0;
(ii) W2
(
x
)
M3W1
(
x
),
xX
, for some constant
M3
1, where the function
W
is as in
Assumption 1.
In addition of Assumptions 1, 2, we impose the following conditions to guarantee the existence of a
pair of optimal strategies.
Assumption 3. (i)
The cost and transition rate functions,
c
(
x, u, v
)and
q
(
·|x, u, v
)are continuous
on U(x)×V(x), for each xX.
(ii)
The integral functions
X
f
(
y
)
q
(
dy|x, u, v
)and
XW
(
y
)
q
(
dy|x, u, v
)are continuous on
U
(
x
)
×
V
(
x
), for each
xX,
for all bounded measurable functions
f
on
X
and
W
as previous in
Assumption 1.
We next introduce some useful notations. Let
Ac
(Ω
×
[0
,ˆ
T
]
×X
)denote the space of all real-valued,
P×B
(
X
)-measurable functions
φ
(
ω,t,x
)which are differentiable in
t
[0
,ˆ
T
]a.e. i.e.,
Ac
(Ω
×
[0
,ˆ
T
]
×X
)
contains the said measurable functions
φ
with the following facets: Given any
xX
,(
ζ12
)
Π
1
Ad ×
Π
2
Ad
, and a.s.
ω
, there exists a
E(φ,ω,x,ζ12)
[0
,ˆ
T
](a Borel subset of [0
,ˆ
T
]that depends
on
φ,x12
) such that
∂φ
∂t
(the partial derivative with respect to time
t
[0
,ˆ
T
]) exists for every
tE(φ,ω,x,ζ12)
and
mL
(
Ec
(φ,ω,x,ζ12)
)=0, where
mL
is the Lebesgue measure on
R
. Now if for some
(
ω, t, x
)
×
[0
,ˆ
T
]
×X
,
∂φ
∂t
(
ω, t, x
)does not exists, we take this as any real number, and so
∂φ
∂t
(
·,·,·
)
can be made definable on
×
[0
,ˆ
T
]
×X
. For any given function
W
1on
X
, a function
f
(real-valued)
on
×
[0
,ˆ
T
]
×X
is said to be a
W
-bounded if
f
W
:=
sup(ω,t,x)×[0,ˆ
T]×X
|f(ω,t,x)|
W(x)<
. The
W
-bounded Banach space is denoted by
BW
(Ω
×
[0
,ˆ
T
]
×X
). Note that if
W
1,
B1
(Ω
×
[0
,ˆ
T
]
×X
)
is the space of all bounded functions on ×[0,ˆ
T]×X.
https://doi.org/10.17993/3cemp.2022.110250.76-92
Now define
C1
W0,W1
(Ω
×
[0
,ˆ
T
]
×X
) :=
{ψBW0
(Ω
×
[0
,ˆ
T
]
×X
)
Ac
(Ω
×
[0
,ˆ
T
]
×X
):
∂ψ
∂t
BW1
(Ω
×
[0
,ˆ
T
]
×X
)
}
. If any function
ψ
(
ω, t, x
)
C1
W0,W1
(Ω
×
[0
,ˆ
T
]
×X
)does not depend on
ω
, we
write it as ψ(t, x)and the corresponding space is C1
W0,W1([0,ˆ
T]×X).
In the the next theorem, we state the extended Feynman-Kac formula, which is very useful for us.
Theorem 1. We grant Assumptions 1 and 2.
(a) Then, for each xX,(ζ12)Π1
Ad ×Π2
Ad and ψC1
W,W1(Ω ×[0,ˆ
T]×X),
Eζ12
xˆ
T
0∂ψ
∂t (ω, t, ξt)+X
ψ(ω, t, y)VU
q(dy|ξt, u, v)ζ1(du|ω, t)ζ2(dv|ω, t)dt
=Eζ12
x[ψ(ω, ˆ
T,ξˆ
T)] Eζ12
xψ(ω, 0,x).
Note that since (
ζ12
)
Π
1
Ad ×
Π
2
Ad
may be dependent on histories,
{ξt,t
0
}
may be not
Markovian.
(b) For each xX,(ζ12)Π1
M×Π2
Mand ψC1
W,W1([0,ˆ
T]×X),
Eζ12
γˆ
T
s∂ψ
∂t (t, ξt)+λc(ξt1
t2
t)et
sλc(ξβ1
β2
β)ψ(t, ξt)
+X
et
sλc(ξβ1
β2
β)ψ(t, y)q(dy|ξt1
t2
t)dt|ξs=x
=Eζ12
γeˆ
T
sλc(ξβ1
β2
β)ψ(ˆ
T,ξˆ
T)|ξs=xψ(s, x).
Proof. See Guo and Zhang (2019), Theorem 3.1.
Next, we present a theorem which shows that the solutions of the optimality equations (Shapley
equations) have unique probabilistic representations. In section 4, we also illustrate how this verification
theorem can be used to determine the game’s value.
Theorem 2. Assume that Assumptions 1 and 2 are true. If there exist a function
ψC1
W,W1
([0
,ˆ
T
]
×X
)
and a pair of stationary strategies (ζ12)Π1
SM ×Π2
SM for which
ψ(s, x)eλg(x)
=E1=ˆ
T
s
sup
ϑ∈P(U(x))
inf
η∈P(V(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η)dt
=E2=ˆ
T
s
inf
η∈P(V(x)) sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η)dt
=ˆ
T
s
inf
η∈P(V(x))λc(x, ζ1(·|x, t))ψ(t, x)+X
ψ(t, y)q(dy|x, ζ1(·|x, t))dt
=ˆ
T
s
sup
ϑ∈P(U(x))λc(x, ϑ, ζ2(·|x, t))ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, ζ2(·|x, t))dt
s[0,ˆ
T],xX,(6)
then
(a)
ψ(0,x) = sup
ζ1Π1
Ad
inf
ζ2Π2
Ad Hζ12(0,x) = inf
ζ2Π2
Ad
sup
ζ1Π1
Ad Hζ12(0,x)
= inf
ζ2Π2
Ad Hζ12(0,x) = sup
ζ1Π1
Ad Hζ12(0,x),xX(7)
and
https://doi.org/10.17993/3cemp.2022.110250.76-92
83
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
(b)
ψ(t, x) = sup
ζ1Π1
M
inf
ζ2Π2
MHζ12(t, x) = inf
ζ2Π2
M
sup
ζ1Π1
MHζ12(t, x)
= inf
ζ2Π2
MHζ12(t, x) = sup
ζ1Π1
MHζ12(t, x)=H(t, x),t[0,ˆ
T],xX.(8)
Proof.
(a) See Golui and Pal (2021a), Corollary 3.1.
(b) This proof follows from part (a).
4
THE EXISTENCE OF OPTIMAL SOLUTION AND SADDLE POINT EQUILI-
BRIUM
This section provides the proof that optimality equation (6) has a solution in the space
C1
W,W1
([0
,ˆ
T
]
×X
).
Furthermore, we use the optimality equation (6) to prove the existence of saddle point equilibrium.
The next Proposition proves the optimality equation (6) has a solution when the rates are bounded.
Proposition 1. Suppose Assumption 3 holds. Also, assume that
q<
,
c<
,
g<
,
c(x, u, v)0and g(x)0, for all (x, u, v)K. Then the following results are true.
(a)
There exists a bounded function
ψB1
([0
,ˆ
T
]
×X
)satisfying first two equations (
E1
and
E2
) of
(6).
(b)
There exists a pair of strategies (
ζ12
)
Π
1
SM ×
Π
2
SM
satisfying the equations (6), (7) and (8)
and hence this forms a saddle-point equilibrium.
(c) H(t, x)(and so ψ(t, x)) is non-increasing in tfor fixed xX, where t[0,ˆ
T].
Proof. (a) From Wei (2017), Theorem 4.1, there exists
ψB1
([0
,ˆ
T
]
×X
)satisfying first two
equations (E1and E2) of (6).
(b) In view of measurable selection theorem as in Nowak (1985), we get the existence of (
ζ12
)
Π1
SM ×Π2
SM for which (6) holds. So, by Theorem 2, we get
sup
ζ1Π1
Ad
inf
ζ2Π2
Ad Hζ12(0,x) = inf
ζ2Π2
Ad
sup
ζ1Π1
Ad Hζ12(0,x) = sup
ζ1Π1
Ad Hζ12(0,x)
= inf
ζ2Π2
Ad Hζ12(0,x)=ψ(0,x)(9)
and
sup
ζ1Π1
M
inf
ζ2Π2
MHζ12(t, x) = inf
ζ2Π2
M
sup
ζ1Π1
MHζ12(t, x) = sup
ζ1Π1
MHζ12(t, x)
= inf
ζ2Π2
MHζ12(t, x)=H(t, x)=ψ(t, x).(10)
Thus the game’s value exists and (ζ12)Π1
SM ×Π2
SM forms a saddle-point equilibrium.
(c) First we fix any
s, t
[0
,ˆ
T
]where
s<t
. Also fix any (
ζ12
)
Π
1
M×
Π
2
M
. Now for each
xX
,
define a Markov strategy corresponding to ζ1Π1
Mas
ζ1
s,t(du|x, β)=ζ1(du|x, β +ts)if βs
ζ1(du|x, β)otherwise.(11)
https://doi.org/10.17993/3cemp.2022.110250.76-92
Similarly, for each ζ2Π2
M, we define ζ2
s,t.
Then, for each
β
[
s, s
+
ˆ
Tt
]and
xX
,
q
(
dy|x, ζ1
s,t
(
du|x, β
)
2
s,t
(
dv|x, β
)) =
q
(
dy|x, ζ1
(
du|x, β
+
ts)2(dv|x, β +ts)),
c(x, ζ1
s,t(du|x, β)2
s,t(dv|x, β)) = c(x, ζ1(du|x, β +ts)2(dv|x, β +ts)).Next define
Hζ12(st, x) := Eζ12
γeλt
sc(ξβ1(du|ξβ)2(dv|ξβ))+λg(ξt)|ξs=x,(12)
H(st, x) := inf
ζ2Π2
M
sup
ζ1Π1
MHζ12(st, x).(13)
Now in view of the Markov property of
{ξt,t
0
}
under any (
ζ12
)
Π
1
M×
Π
2
M
and (11)-(13), we
have Hζ12(tˆ
T,x)=Hζ1
s,t2
s,t (sˆ
T+st, x).
It can be easily shown that
supζ1
s,tΠ1
MHζ1
s,t2
s,t
(
sˆ
T
+
st, x
)
sup
ζ1Π1
MHζ12
(
tT,x
)and
sup
ζ1Π1
MHζ12
(
tˆ
T,x
)
sup
ζ1
s,tΠ1
MHζ1
s,t2
s,t
(
sˆ
T
+
st, x
)for all
ζ2
Π
M
2
. Hence,
sup
ζ1Π1
MHζ12
(
t
ˆ
T,x
)=
sup
ζ1
s,tΠ1
MHζ1
s,t2
s,t
(
sˆ
T
+
st, x
)for all
ζ2
Π
M
2
. Similarly, we can show that
H
(
tˆ
T,x
)=
H
(
sˆ
T
+
st, x
). Now since
c
(
x, u, v
)
0on
K
, by (13) and
t>s
, we have
H
(
tˆ
T,x
)=
H
(
s
ˆ
T
+
st, x
)
≤H
(
sˆ
T,x
). But by
(10)
,
(12)
and
(13)
, we have
H
(
tˆ
T,x
)=
H
(
t, x
). Hence, we
obtain
H
(
s, x
)
≥H
(
t, x
)i.e.
H
(
t, x
)is decreasing in
t
. Now from part (b), we have
H
(
t, x
)=
ψ
(
t, x
).
Hence, ψ(t, x)is also decreasing in t.
Theorem 3. Suppose Assumptions 1, 2 and 3 hold. Also, in addition suppose
c
(
x, u, v
)
0and
g
(
x
)
0
for all (
x, u, v
)
K
. Then there exist a unique
ψC1
W,W1
([0
,ˆ
T
]
×X
)and some pair of strategies
(
ζ12
)
Π
1
SM ×
Π
2
SM
satisfying the equations (6), (7) and (8) and hence this is a saddle-point
equilibrium.
Proof. First observe that 1
e2( ˆ
T+1)λc(x,u,v)M2W
(
x
)and 1
e2( ˆ
T+1)λg(x)M2W
(
x
). For each
integer
n
1,
xX
, define
Xn
:=
{xX|W
(
x
)
n}
,
Un
(
x
) :=
U
(
x
)and
Vn
(
x
) :=
V
(
x
)
.
Also for
each (x, u, v)Kn:= {(x, u, v):xX,u Un(x),v Vn(x)}, define
qn(dy|x, u, v) := q(dy|x, u, v)if xXn,
0if x/Xn,(14)
c+
n(x, u, v) := c(x, u, v)min n, 1
λ(ˆ
T+1) ln M2W(x)if xXn,
0if x/Xn.(15)
and
g+
n(x) := g(x)min n, 1
λ(ˆ
T+1) ln M2W(x)if xXn,
0if x/Xn.(16)
By (14), obviously
qn
(
dy|x, u, v
)is transition rates on
X
satisfying conservative and stable conditions.
Now consider the sequence of CTMDPs models with bounded rates
G+
n
:=
{X, U, V,
(
Un
(
x
)
,V
n
(
x
)
,x
X
)
,c
+
n,g
+
n,q
n}
. Fix a
n
. Corresponding to a pair of Markov strategies (
ζ12
)
Π
1
M×
Π
2
M
, suppose
for this model the risk-sensitive cost criterion is Hζ12
n(t, x)and the value function is
Hn(t, x) := sup
ζ1Π1
M
inf
ζ2Π2
MHζ12
n(t, x).
Then by Proposition 1, for each
n
1, we get a unique
ψn
in
C1
1,1
([0
,ˆ
T
])
×S
and (
ζ1
n2
n
)
Π
1
SM ×
Π
2
SM
satisfying
ψn(s, x)eλg+
n(x)
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
84
(b)
ψ(t, x) = sup
ζ1Π1
M
inf
ζ2Π2
MHζ12(t, x) = inf
ζ2Π2
M
sup
ζ1Π1
MHζ12(t, x)
= inf
ζ2Π2
MHζ12(t, x) = sup
ζ1Π1
MHζ12(t, x)=H(t, x),t[0,ˆ
T],xX.(8)
Proof.
(a) See Golui and Pal (2021a), Corollary 3.1.
(b) This proof follows from part (a).
4
THE EXISTENCE OF OPTIMAL SOLUTION AND SADDLE POINT EQUILI-
BRIUM
This section provides the proof that optimality equation (6) has a solution in the space
C1
W,W1
([0
,ˆ
T
]
×X
).
Furthermore, we use the optimality equation (6) to prove the existence of saddle point equilibrium.
The next Proposition proves the optimality equation (6) has a solution when the rates are bounded.
Proposition 1. Suppose Assumption 3 holds. Also, assume that
q<
,
c<
,
g<
,
c(x, u, v)0and g(x)0, for all (x, u, v)K. Then the following results are true.
(a)
There exists a bounded function
ψB1
([0
,ˆ
T
]
×X
)satisfying first two equations (
E1
and
E2
) of
(6).
(b)
There exists a pair of strategies (
ζ12
)
Π
1
SM ×
Π
2
SM
satisfying the equations (6), (7) and (8)
and hence this forms a saddle-point equilibrium.
(c) H(t, x)(and so ψ(t, x)) is non-increasing in tfor fixed xX, where t[0,ˆ
T].
Proof. (a) From Wei (2017), Theorem 4.1, there exists
ψB1
([0
,ˆ
T
]
×X
)satisfying first two
equations (E1and E2) of (6).
(b) In view of measurable selection theorem as in Nowak (1985), we get the existence of (
ζ12
)
Π1
SM ×Π2
SM for which (6) holds. So, by Theorem 2, we get
sup
ζ1Π1
Ad
inf
ζ2Π2
Ad Hζ12(0,x) = inf
ζ2Π2
Ad
sup
ζ1Π1
Ad Hζ12(0,x) = sup
ζ1Π1
Ad Hζ12(0,x)
= inf
ζ2Π2
Ad Hζ12(0,x)=ψ(0,x)(9)
and
sup
ζ1Π1
M
inf
ζ2Π2
MHζ12(t, x) = inf
ζ2Π2
M
sup
ζ1Π1
MHζ12(t, x) = sup
ζ1Π1
MHζ12(t, x)
= inf
ζ2Π2
MHζ12(t, x)=H(t, x)=ψ(t, x).(10)
Thus the game’s value exists and (ζ12)Π1
SM ×Π2
SM forms a saddle-point equilibrium.
(c) First we fix any
s, t
[0
,ˆ
T
]where
s<t
. Also x any (
ζ12
)
Π
1
M×
Π
2
M
. Now for each
xX
,
define a Markov strategy corresponding to ζ1Π1
Mas
ζ1
s,t(du|x, β)=ζ1(du|x, β +ts)if βs
ζ1(du|x, β)otherwise.(11)
https://doi.org/10.17993/3cemp.2022.110250.76-92
Similarly, for each ζ2Π2
M, we define ζ2
s,t.
Then, for each
β
[
s, s
+
ˆ
Tt
]and
xX
,
q
(
dy|x, ζ1
s,t
(
du|x, β
)
2
s,t
(
dv|x, β
)) =
q
(
dy|x, ζ1
(
du|x, β
+
ts)2(dv|x, β +ts)),
c(x, ζ1
s,t(du|x, β)2
s,t(dv|x, β)) = c(x, ζ1(du|x, β +ts)2(dv|x, β +ts)).Next define
Hζ12(st, x) := Eζ12
γeλt
sc(ξβ1(du|ξβ)2(dv|ξβ))+λg(ξt)|ξs=x,(12)
H(st, x) := inf
ζ2Π2
M
sup
ζ1Π1
MHζ12(st, x).(13)
Now in view of the Markov property of
{ξt,t
0
}
under any (
ζ12
)
Π
1
M×
Π
2
M
and (11)-(13), we
have Hζ12(tˆ
T,x)=Hζ1
s,t2
s,t (sˆ
T+st, x).
It can be easily shown that
supζ1
s,tΠ1
MHζ1
s,t2
s,t
(
sˆ
T
+
st, x
)
sup
ζ1Π1
MHζ12
(
tT,x
)and
sup
ζ1Π1
MHζ12
(
tˆ
T,x
)
sup
ζ1
s,tΠ1
MHζ1
s,t2
s,t
(
sˆ
T
+
st, x
)for all
ζ2
Π
M
2
. Hence,
sup
ζ1Π1
MHζ12
(
t
ˆ
T,x
)=
sup
ζ1
s,tΠ1
MHζ1
s,t2
s,t
(
sˆ
T
+
st, x
)for all
ζ2
Π
M
2
. Similarly, we can show that
H
(
tˆ
T,x
)=
H
(
sˆ
T
+
st, x
). Now since
c
(
x, u, v
)
0on
K
, by (13) and
t>s
, we have
H
(
tˆ
T,x
)=
H
(
s
ˆ
T
+
st, x
)
≤H
(
sˆ
T,x
). But by
(10)
,
(12)
and
(13)
, we have
H
(
tˆ
T,x
)=
H
(
t, x
). Hence, we
obtain
H
(
s, x
)
≥H
(
t, x
)i.e.
H
(
t, x
)is decreasing in
t
. Now from part (b), we have
H
(
t, x
)=
ψ
(
t, x
).
Hence, ψ(t, x)is also decreasing in t.
Theorem 3. Suppose Assumptions 1, 2 and 3 hold. Also, in addition suppose
c
(
x, u, v
)
0and
g
(
x
)
0
for all (
x, u, v
)
K
. Then there exist a unique
ψC1
W,W1
([0
,ˆ
T
]
×X
)and some pair of strategies
(
ζ12
)
Π
1
SM ×
Π
2
SM
satisfying the equations (6), (7) and (8) and hence this is a saddle-point
equilibrium.
Proof. First observe that 1
e2( ˆ
T+1)λc(x,u,v)M2W
(
x
)and 1
e2( ˆ
T+1)λg(x)M2W
(
x
). For each
integer
n
1,
xX
, define
Xn
:=
{xX|W
(
x
)
n}
,
Un
(
x
) :=
U
(
x
)and
Vn
(
x
) :=
V
(
x
)
.
Also for
each (x, u, v)Kn:= {(x, u, v):xX,u Un(x),v Vn(x)}, define
qn(dy|x, u, v) := q(dy|x, u, v)if xXn,
0if x/Xn,(14)
c+
n(x, u, v) := c(x, u, v)min n, 1
λ(ˆ
T+1) ln M2W(x)if xXn,
0if x/Xn.(15)
and
g+
n(x) := g(x)min n, 1
λ(ˆ
T+1) ln M2W(x)if xXn,
0if x/Xn.(16)
By (14), obviously
qn
(
dy|x, u, v
)is transition rates on
X
satisfying conservative and stable conditions.
Now consider the sequence of CTMDPs models with bounded rates
G+
n
:=
{X, U, V,
(
Un
(
x
)
,V
n
(
x
)
,x
X
)
,c
+
n,g
+
n,q
n}
. Fix a
n
. Corresponding to a pair of Markov strategies (
ζ12
)
Π
1
M×
Π
2
M
, suppose
for this model the risk-sensitive cost criterion is Hζ12
n(t, x)and the value function is
Hn(t, x) := sup
ζ1Π1
M
inf
ζ2Π2
MHζ12
n(t, x).
Then by Proposition 1, for each
n
1, we get a unique
ψn
in
C1
1,1
([0
,ˆ
T
])
×S
and (
ζ1
n2
n
)
Π
1
SM ×
Π
2
SM
satisfying
ψn(s, x)eλg+
n(x)
https://doi.org/10.17993/3cemp.2022.110250.76-92
85
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
=ˆ
T
s
inf
η∈P(V(x))λc+
n(x, ζ1
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn(dy|x, ζ1
n(·|x, t))dt
=ˆ
T
s
sup
ϑ∈P(U(x))λc+
n(x, ϑ, ζ2
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, ζ2
n(·|x, t))dt
s[0,ˆ
T],xX.(17)
Now,
e2λ(ˆ
T+1)c+
n(x,u,v)M2W
(
x
),
e2λ(ˆ
T+1)g+
n(x)M2W
(
x
)and
ψn
(
ˆ
T,x
)=
eλg+
n(x)
. Hence by Lemma
1, Theorem 2 and (17), we have
eL1W(x)ψn(t, x) = sup
ζ1ΠM
1Hζ12
n
n(t, x)L1W(x)n1.(18)
Moreover, since
ψn
(
t, x
)
0,
c+
n1
(
x, u, v
)
c+
n
(
x, u, v
), and
g+
n1
(
t, x
)
g+
n
(
x
)
(
x, u, v
)
K
, using
(14), (15), (17) and Proposition 1, xXand a.e. t, we obtain,
∂ψn
∂t (t, x)+λc+
n1(x, ϑ, ζ2
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn1(dy|x, ϑ, ζ2
n(·|x, t))
0if xXn1
(19)
and
∂ψn
∂t (t, x)+λc+
n1(x, ϑ, ζ2
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn1(dy|x, ϑ, ζ2
n(·|x, t))
=∂ψn
∂t (t, x)0if x/Xn1,
(20)
(for details see, Golui and Pal (2021b), Theorem 4.1, p. 24). So, for any
ζ1
Π
1
M
, by Feynman-Kac
formula (similar proof as in Theorem 2), we get
Hζ12
n
n1(t, x)ψn(t, x).
Since ζ1Π1
Mis arbitrary
inf
ζ2Π2
M
sup
ζ1Π1
MHζ12
n1(t, x)sup
ζ1Π1
MHζ12
n
n1(t, x)ψn(t, x).(21)
Also using (17) and Feynman-Kac formula (similar proof as in Theorem 2), we have
sup
ζ1Π1
M
inf
ζ2Π2
MHζ12
n1(t, x) = inf
ζ2Π2
M
sup
ζ1Π1
MHζ12
n1(t, x)=ψn1(t, x).(22)
From (21) and (22), we obtain
ψn1
(
t, x
)
ψn
(
t, x
). Also, since
ψn
has an upper bound,
limn→∞ ψn
exists. Let
lim
n→∞ ψn(t, x) := ψ(t, x)t[0,ˆ
T],xX.(23)
Next by Lemma 1, we get
|ψ(t, x)|≤L1W(x)t[0,ˆ
T].(24)
Let
In(t, x) := sup
ϑ∈P(Un(x))
inf
η∈P(Vn(x))λc+
n(x, ϑ, η)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, η),
t[0,ˆ
T],xX.
Then, applying Fan’s minimax theorem, Fan, (1953), we obtain
In(t, x) := inf
η∈P(Vn(x)) sup
ϑ∈P(Un(x))λc+
n(x, ϑ, η)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, η),
https://doi.org/10.17993/3cemp.2022.110250.76-92
t[0,ˆ
T],xX.
Then, by Assumptions 1 and 2 and the fact that λ1, we get the following result
|In(t, x)|≤L1M2W2(x)+(b1+ρ1)W2(x)+2M1W2(x)
L1M3W1(x)(M2+b1+ρ1+2M1) =: R(x),(t, x)[0,ˆ
T]×X.(25)
Let
I(t, x) := sup
ϑ∈P(U(x))
inf
η∈P(V(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η),
t[0,ˆ
T],xX.
Hence in view of Fan’s minimax theorem, Fan, (1953), we obtain
I(t, x) := inf
η∈P(V(x)) sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η),
t[0,ˆ
T],xX.
We next prove that for each fixed
xX
and
t
[0
,ˆ
T
], along some suitable subsequence of
{n}
(if necessary),
limn In
(
t, x
)=
I
(
t, x
). Now, using Assumption 3, the functions
c
(
x, ϑ, η
)and
Xq
(
dy|x, ϑ, η
)
ψn
(
t, y
)are continuous on
P
(
U
(
x
))
×P
(
V
(
x
)) for each
xX
. So, we nd a se-
quence of pair of measurable functions (ϑ
n
n)∈P(U(x)) ×P(V(x)) such that
In(t, x) : = inf
η∈P(V(x))λc+
n(x, ϑ
n)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ
n)
= sup
ϑ∈P(U(x))λc+
n(x, ϑ, η
n)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, η
n).(26)
Now,
P
(
U
(
x
)) and
P
(
V
(
x
)) are compact. So, there exists a subsequences (here, we take the same
sequence for simplicity) that
ϑ
nϑ
and
η
nη
as
n→∞
for some (
ϑ
)
∈P
(
U
(
x
))
×P
(
V
(
x
)).
Taking
n→∞
in
(26)
, by the generalized version of Fatou’s lemma Feinberge et al. (2014),
Hernandez-Lerma and Lasserre (1999), Lemma 8.3.7, for arbitrarily fixed ϑ∈P(U(x)), we have
lim inf
n In(t, x)λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η).
Since ϑ∈P(U(x)) is arbitrary,
lim inf
n In(t, x)sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η)
inf
η∈P(V(x)) sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η).(27)
Using analogous arguments from
(26)
, by the generalized version of Fatou’s Lemma, Feinberge et al.
(2014), Hernandez-Lerma and lasserre (1999), Lemma 8.3.7, we have
lim sup
n In(t, x)sup
ϑ∈P(U(x))
inf
η∈P(V(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η).(28)
So, by (27) and (28), we get
lim
n In(t, x)=I(t, x)t[0,ˆ
T],xX.(29)
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
86
=ˆ
T
s
inf
η∈P(V(x))λc+
n(x, ζ1
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn(dy|x, ζ1
n(·|x, t))dt
=ˆ
T
s
sup
ϑ∈P(U(x))λc+
n(x, ϑ, ζ2
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, ζ2
n(·|x, t))dt
s[0,ˆ
T],xX.(17)
Now,
e2λ(ˆ
T+1)c+
n(x,u,v)M2W
(
x
),
e2λ(ˆ
T+1)g+
n(x)M2W
(
x
)and
ψn
(
ˆ
T,x
)=
eλg+
n(x)
. Hence by Lemma
1, Theorem 2 and (17), we have
eL1W(x)ψn(t, x) = sup
ζ1ΠM
1Hζ12
n
n(t, x)L1W(x)n1.(18)
Moreover, since
ψn
(
t, x
)
0,
c+
n1
(
x, u, v
)
c+
n
(
x, u, v
), and
g+
n1
(
t, x
)
g+
n
(
x
)
(
x, u, v
)
K
, using
(14), (15), (17) and Proposition 1, xXand a.e. t, we obtain,
∂ψn
∂t (t, x)+λc+
n1(x, ϑ, ζ2
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn1(dy|x, ϑ, ζ2
n(·|x, t))
0if xXn1
(19)
and
∂ψn
∂t (t, x)+λc+
n1(x, ϑ, ζ2
n(·|x, t))ψn(t, x)+X
ψn(t, y)qn1(dy|x, ϑ, ζ2
n(·|x, t))
=∂ψn
∂t (t, x)0if x/Xn1,
(20)
(for details see, Golui and Pal (2021b), Theorem 4.1, p. 24). So, for any
ζ1
Π
1
M
, by Feynman-Kac
formula (similar proof as in Theorem 2), we get
Hζ12
n
n1(t, x)ψn(t, x).
Since ζ1Π1
Mis arbitrary
inf
ζ2Π2
M
sup
ζ1Π1
MHζ12
n1(t, x)sup
ζ1Π1
MHζ12
n
n1(t, x)ψn(t, x).(21)
Also using (17) and Feynman-Kac formula (similar proof as in Theorem 2), we have
sup
ζ1Π1
M
inf
ζ2Π2
MHζ12
n1(t, x) = inf
ζ2Π2
M
sup
ζ1Π1
MHζ12
n1(t, x)=ψn1(t, x).(22)
From (21) and (22), we obtain
ψn1
(
t, x
)
ψn
(
t, x
). Also, since
ψn
has an upper bound,
limn ψn
exists. Let
lim
n ψn(t, x) := ψ(t, x)t[0,ˆ
T],xX.(23)
Next by Lemma 1, we get
|ψ(t, x)|≤L1W(x)t[0,ˆ
T].(24)
Let
In(t, x) := sup
ϑ∈P(Un(x))
inf
η∈P(Vn(x))λc+
n(x, ϑ, η)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, η),
t[0,ˆ
T],xX.
Then, applying Fan’s minimax theorem, Fan, (1953), we obtain
In(t, x) := inf
η∈P(Vn(x)) sup
ϑ∈P(Un(x))λc+
n(x, ϑ, η)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, η),
https://doi.org/10.17993/3cemp.2022.110250.76-92
t[0,ˆ
T],xX.
Then, by Assumptions 1 and 2 and the fact that λ1, we get the following result
|In(t, x)|≤L1M2W2(x)+(b1+ρ1)W2(x)+2M1W2(x)
L1M3W1(x)(M2+b1+ρ1+2M1) =: R(x),(t, x)[0,ˆ
T]×X.(25)
Let
I(t, x) := sup
ϑ∈P(U(x))
inf
η∈P(V(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η),
t[0,ˆ
T],xX.
Hence in view of Fan’s minimax theorem, Fan, (1953), we obtain
I(t, x) := inf
η∈P(V(x)) sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η),
t[0,ˆ
T],xX.
We next prove that for each fixed
xX
and
t
[0
,ˆ
T
], along some suitable subsequence of
{n}
(if necessary),
limn→∞ In
(
t, x
)=
I
(
t, x
). Now, using Assumption 3, the functions
c
(
x, ϑ, η
)and
Xq
(
dy|x, ϑ, η
)
ψn
(
t, y
)are continuous on
P
(
U
(
x
))
×P
(
V
(
x
)) for each
xX
. So, we find a se-
quence of pair of measurable functions (ϑ
n
n)∈P(U(x)) ×P(V(x)) such that
In(t, x) : = inf
η∈P(V(x))λc+
n(x, ϑ
n)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ
n)
= sup
ϑ∈P(U(x))λc+
n(x, ϑ, η
n)ψn(t, x)+X
ψn(t, y)qn(dy|x, ϑ, η
n).(26)
Now,
P
(
U
(
x
)) and
P
(
V
(
x
)) are compact. So, there exists a subsequences (here, we take the same
sequence for simplicity) that
ϑ
nϑ
and
η
nη
as
n→∞
for some (
ϑ
)
∈P
(
U
(
x
))
×P
(
V
(
x
)).
Taking
n→∞
in
(26)
, by the generalized version of Fatou’s lemma Feinberge et al. (2014),
Hernandez-Lerma and Lasserre (1999), Lemma 8.3.7, for arbitrarily fixed ϑ∈P(U(x)), we have
lim inf
n→∞ In(t, x)λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η).
Since ϑ∈P(U(x)) is arbitrary,
lim inf
n→∞ In(t, x)sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η)
inf
η∈P(V(x)) sup
ϑ∈P(U(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η).(27)
Using analogous arguments from
(26)
, by the generalized version of Fatou’s Lemma, Feinberge et al.
(2014), Hernandez-Lerma and lasserre (1999), Lemma 8.3.7, we have
lim sup
n→∞ In(t, x)sup
ϑ∈P(U(x))
inf
η∈P(V(x))λc(x, ϑ, η)ψ(t, x)+X
ψ(t, y)q(dy|x, ϑ, η).(28)
So, by (27) and (28), we get
lim
n→∞ In(t, x)=I(t, x)t[0,ˆ
T],xX.(29)
https://doi.org/10.17993/3cemp.2022.110250.76-92
87
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
Since
limn→∞ ψn
(
t, x
)=
ψ
(
t, x
)and
t
[0
,ˆ
T
]
,xX
, in view of (29) and the dominated convergent
theorem (since
|In
(
t, x
)
|≤R
(
x
)), taking limit
n→∞
in (17), we say that
ψ
satisfies first two equations
(
E1
and
E2
) of (6) and hence
ψ
(
·,x
)is differentiable almost everywhere on [0
,ˆ
T
], see Athreya (2006),
Theorem 4.4.1. Again, by the analogous arguments as in (25), we obtain
∂ψ(t, x)
∂t =|I(t, x)|≤R(x),t[0,ˆ
T],xX.
Therefore, we see that
ψC1
W,W1
([0
,ˆ
T
]
×X
). Furthermore, using analogous arguments as in Proposition
1 (b), ψis the unique solution of (6) satisfying (7) and (8) and hence saddle-point equilibrium exists.
Next we state the main optimal results that provide the proof of the existence of saddle point
equilibrium and game’s value when payoff rates are extended real valued functions.
Theorem 4. We grant Assumptions 1, 2 and 3. Then, the following claims are true.
(a)
There exists a unique function
ψC1
W,W1
([0
,ˆ
T
]
×X
)that satisfies first two equations (
E1
and
E2) of (6).
(b)
There exists a pair of strategies (
ζ12
)
Π
1
SM ×
Π
2
SM
that satisfies the equations (6), (7) and
(8) and hence this pair of strategies becomes a saddle-point equilibrium.
Proof. We only need prove part (a) since part (b) follows from Proposition 1 (b). Now, for each
n
1,
define cnand gnon Kas:
cn(x, u, v) := max{−n, c(x, u, v)},g
n(x) := max{−n, g(x)}
for each (
x, u, v
)
K
. Then
limn→∞ cn
(
x, u, v
)=
c
(
x, u, v
)and
limn→∞ gn
(
x
)=
g
(
x
). Define
rn
(
x, u, v
) :=
cn
(
x, u, v
)+
n
and
˜gn
(
x
) :=
gn
(
x
)+
n
. So,
rn
(
x, u, v
)
0and
˜gn
(
x
)
0for each
n
1and (
x, u, v
)
K
.
Now by Assumption 1, we have
ln M2W(x)
λ(ˆ
T+ 1) max n, ln M2W(x)
λ(ˆ
T+ 1) cn(x, u, v)ln M2W(x)
λ(ˆ
T+ 1) (30)
and
ln M2W(x)
λ(ˆ
T+ 1) max n, ln M2W(x)
λ(ˆ
T+ 1) gn(x)ln M2W(x)
λ(ˆ
T+ 1) .(31)
So, we have
e2λ(ˆ
T+1)rn(x,u,v)e2λ(ˆ
T+1)nM2W
(
x
)and
e2λ(ˆ
T+1)˜gn(x)e2λ(ˆ
T+1)nM2W
(
x
),
n
1and
(
x, u, v
)
K
. Define a new model
Rn
:=
{X, U, V,
(
Un
(
x
)
,V
n
(
x
)
,x X
)
,r
n,˜gn,q}
. Now for any
real-valued measurable functions ˜
ψand ϕdefined on Kand [0,ˆ
T]×X, respectively, define
H(s, x, ˜
ψ, ϕ) := sup
ζ1Π1
M
inf
ζ2Π2
M
Eζ12
γexpλˆ
T
s
˜
ψ(ξt
1
t
2
t)dt +λϕ(ξˆ
T)ξs=x(32)
assuming that the integral exists. Now since
rn
0,
˜gn
0and all Assumptions hold for the model
Rn
,
by Theorem 3, we have
H(s, x, rn,˜gn)
∂s
= sup
ϑ∈P(U(x))
inf
η∈P(V(x))λrn(x, ϑ, ν)H(s, x, rn,˜gn)+XH(s, y, rn,˜gn)q(dy|x, ϑ, η)
= inf
η∈P(V(x)) sup
ϑ∈P(U(x))λrn(x, ϑ, ν)H(s, x, rn,˜gn)+XH(s, y, rn,˜gn)q(dy|x, ϑ, η)(33)
https://doi.org/10.17993/3cemp.2022.110250.76-92
for almost all s[0,ˆ
T]. Now
H(s, x, rn,˜gn)=H(s, x, cn+n, gn+n)=H(s, x, cn,g
n)eλ(ˆ
Ts+1)n.
So, by (33), we can write for a.e. s[0,ˆ
T],
H(s, x, cn,g
n)
∂s = sup
ϑ∈P(U(x))
inf
η∈P(V(x))λcn(x, ϑ, η)H(s, x, cn,g
n)+XH(s, y, cn,g
n)q(dy|x, ϑ, η)
= inf
η∈P(V(x)) sup
ϑ∈P(U(x))λcn(x, ϑ, η)H(s, x, cn,g
n)+XH(s, y, cn,g
n)q(dy|x, ϑ, η).
Hence
H(s, x, cn,g
n)eλgn(x)
=ˆ
T
s
sup
ϑ∈P(U(x))
inf
η∈P(V(x))λcn(x, ϑ, η)H(t, x, cn,g
n)+XH(t,y,c
n,g
n)q(dy|x, ϑ, η)dt
=ˆ
T
s
inf
η∈P(V(x)) sup
ϑ∈P(U(x))λcn(x, ϑ, η)H(t, x, cn,g
n)+XH(t,y,c
n,g
n)q(dy|x, ϑ, η)dt. (34)
Now by (34) and Lemma 1, we obtain
|H(t, x, cn,g
n)|≤L1W(x)n1.(35)
Now since
cn
(
x, u, v
)and
gn
(
x
)are non-increasing in
n
1, hence its corresponding value function
H
(
t, x, cn,g
n
)is also non-increasing in
n
. Also by Lemma 1, we know that
H
(
·,·,c
n,g
n
)has a lower
bound. So,
limn H
(
t, x, cn,g
n
)exists. Let
limn H
(
t, x, cn,g
n
) =:
ψ
(
t, x
),(
t, x
)
[0
,ˆ
T
]
×X
. Then
using analogous arguments as Theorem 4.1, and using the function
H
(
t, x, cn,g
n
)in the place of the
function ψn(t, x)here, by (34), (35), Assumptions 1, and 2, we see that (a) is true.
The converse of Theorem 4 is given below.
Theorem 5. Under Assumptions 1, 2 and 3, suppose (
ˆ
ζ1,ˆ
ζ2
)
Π
1
SM ×
Π
2
SM
is a saddle-point
equilibria. Then (ˆ
ζ1,ˆ
ζ2)is a mini-max selector of eq. (6).
Proof. Using the definition of saddle-point equilibrium, we have
Hˆ
ζ1,ˆ
ζ2(0,x) = sup
ζ2Π2
Ad
inf
ζ1Π1
Ad Hζ12(0,x)
= inf
ζ1Π1
Ad
sup
ζ2Π2
Ad Hζ12(0,x) = sup
ζ2Π2
Ad Hˆ
ζ12(0,x) = inf
ζ1Π1
Ad Hζ1,ˆ
ζ2(0,x).(36)
Now arguing as in Theorem 4, it follows that for
ˆ
ζ1
Π
1
SM
there exists a function
˜
ψC1
W,W1
([0
,ˆ
T
]
×X
)
such that
˜
ψ(s, x)eλg(x)
=ˆ
T
s
inf
η∈P(V(x))λc(x, ˆ
ζ1(·|x, t))˜
ψ(t, x)+X
˜
ψ(t, y)q(dy|x, ˆ
ζ1(·|x, t))dt
s[0,ˆ
T],xX,(37)
satisfying
˜
ψ(0,x) = inf
ζ2Π2
Ad Hˆ
ζ12(0,x)(38)
and
˜
ψ(t, x) = inf
ζ2Π2
mHˆ
ζ12(t, x).(39)
Then by
(6)
,
(36)
,
(37)
,
(38)
,
(39)
, Theorem 2, Theorem 4, we say that
ˆ
ζ1
is outer maximizing selector
of (6). By analogous arguments, ˆ
ζ2is outer minimizing selector of (6).
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
88
Since
limn ψn
(
t, x
)=
ψ
(
t, x
)and
t
[0
,ˆ
T
]
,xX
, in view of (29) and the dominated convergent
theorem (since
|In
(
t, x
)
|≤R
(
x
)), taking limit
n→∞
in (17), we say that
ψ
satisfies first two equations
(
E1
and
E2
) of (6) and hence
ψ
(
·,x
)is differentiable almost everywhere on [0
,ˆ
T
], see Athreya (2006),
Theorem 4.4.1. Again, by the analogous arguments as in (25), we obtain
∂ψ(t, x)
∂t =|I(t, x)|≤R(x),t[0,ˆ
T],xX.
Therefore, we see that
ψC1
W,W1
([0
,ˆ
T
]
×X
). Furthermore, using analogous arguments as in Proposition
1 (b), ψis the unique solution of (6) satisfying (7) and (8) and hence saddle-point equilibrium exists.
Next we state the main optimal results that provide the proof of the existence of saddle point
equilibrium and game’s value when payoff rates are extended real valued functions.
Theorem 4. We grant Assumptions 1, 2 and 3. Then, the following claims are true.
(a)
There exists a unique function
ψC1
W,W1
([0
,ˆ
T
]
×X
)that satisfies first two equations (
E1
and
E2) of (6).
(b)
There exists a pair of strategies (
ζ12
)
Π
1
SM ×
Π
2
SM
that satisfies the equations (6), (7) and
(8) and hence this pair of strategies becomes a saddle-point equilibrium.
Proof. We only need prove part (a) since part (b) follows from Proposition 1 (b). Now, for each
n
1,
define cnand gnon Kas:
cn(x, u, v) := max{−n, c(x, u, v)},g
n(x) := max{−n, g(x)}
for each (
x, u, v
)
K
. Then
limn cn
(
x, u, v
)=
c
(
x, u, v
)and
limn gn
(
x
)=
g
(
x
). Define
rn
(
x, u, v
) :=
cn
(
x, u, v
)+
n
and
˜gn
(
x
) :=
gn
(
x
)+
n
. So,
rn
(
x, u, v
)
0and
˜gn
(
x
)
0for each
n
1and (
x, u, v
)
K
.
Now by Assumption 1, we have
ln M2W(x)
λ(ˆ
T+ 1) max n, ln M2W(x)
λ(ˆ
T+ 1) cn(x, u, v)ln M2W(x)
λ(ˆ
T+ 1) (30)
and
ln M2W(x)
λ(ˆ
T+ 1) max n, ln M2W(x)
λ(ˆ
T+ 1) gn(x)ln M2W(x)
λ(ˆ
T+ 1) .(31)
So, we have
e2λ(ˆ
T+1)rn(x,u,v)e2λ(ˆ
T+1)nM2W
(
x
)and
e2λ(ˆ
T+1)˜gn(x)e2λ(ˆ
T+1)nM2W
(
x
),
n
1and
(
x, u, v
)
K
. Define a new model
Rn
:=
{X, U, V,
(
Un
(
x
)
,V
n
(
x
)
,x X
)
,r
n,˜gn,q}
. Now for any
real-valued measurable functions ˜
ψand ϕdefined on Kand [0,ˆ
T]×X, respectively, define
H(s, x, ˜
ψ, ϕ) := sup
ζ1Π1
M
inf
ζ2Π2
M
Eζ12
γexpλˆ
T
s
˜
ψ(ξt
1
t
2
t)dt +λϕ(ξˆ
T)ξs=x(32)
assuming that the integral exists. Now since
rn
0,
˜gn
0and all Assumptions hold for the model
Rn
,
by Theorem 3, we have
H(s, x, rn,˜gn)
∂s
= sup
ϑ∈P(U(x))
inf
η∈P(V(x))λrn(x, ϑ, ν)H(s, x, rn,˜gn)+XH(s, y, rn,˜gn)q(dy|x, ϑ, η)
= inf
η∈P(V(x)) sup
ϑ∈P(U(x))λrn(x, ϑ, ν)H(s, x, rn,˜gn)+XH(s, y, rn,˜gn)q(dy|x, ϑ, η)(33)
https://doi.org/10.17993/3cemp.2022.110250.76-92
for almost all s[0,ˆ
T]. Now
H(s, x, rn,˜gn)=H(s, x, cn+n, gn+n)=H(s, x, cn,g
n)eλ(ˆ
Ts+1)n.
So, by (33), we can write for a.e. s[0,ˆ
T],
H(s, x, cn,g
n)
∂s = sup
ϑ∈P(U(x))
inf
η∈P(V(x))λcn(x, ϑ, η)H(s, x, cn,g
n)+XH(s, y, cn,g
n)q(dy|x, ϑ, η)
= inf
η∈P(V(x)) sup
ϑ∈P(U(x))λcn(x, ϑ, η)H(s, x, cn,g
n)+XH(s, y, cn,g
n)q(dy|x, ϑ, η).
Hence
H(s, x, cn,g
n)eλgn(x)
=ˆ
T
s
sup
ϑ∈P(U(x))
inf
η∈P(V(x))λcn(x, ϑ, η)H(t, x, cn,g
n)+XH(t,y,c
n,g
n)q(dy|x, ϑ, η)dt
=ˆ
T
s
inf
η∈P(V(x)) sup
ϑ∈P(U(x))λcn(x, ϑ, η)H(t, x, cn,g
n)+XH(t,y,c
n,g
n)q(dy|x, ϑ, η)dt. (34)
Now by (34) and Lemma 1, we obtain
|H(t, x, cn,g
n)|≤L1W(x)n1.(35)
Now since
cn
(
x, u, v
)and
gn
(
x
)are non-increasing in
n
1, hence its corresponding value function
H
(
t, x, cn,g
n
)is also non-increasing in
n
. Also by Lemma 1, we know that
H
(
·,·,c
n,g
n
)has a lower
bound. So,
limn→∞ H
(
t, x, cn,g
n
)exists. Let
limn→∞ H
(
t, x, cn,g
n
) =:
ψ
(
t, x
),(
t, x
)
[0
,ˆ
T
]
×X
. Then
using analogous arguments as Theorem 4.1, and using the function
H
(
t, x, cn,g
n
)in the place of the
function ψn(t, x)here, by (34), (35), Assumptions 1, and 2, we see that (a) is true.
The converse of Theorem 4 is given below.
Theorem 5. Under Assumptions 1, 2 and 3, suppose (
ˆ
ζ1,ˆ
ζ2
)
Π
1
SM ×
Π
2
SM
is a saddle-point
equilibria. Then (ˆ
ζ1,ˆ
ζ2)is a mini-max selector of eq. (6).
Proof. Using the definition of saddle-point equilibrium, we have
Hˆ
ζ1,ˆ
ζ2(0,x) = sup
ζ2Π2
Ad
inf
ζ1Π1
Ad Hζ12(0,x)
= inf
ζ1Π1
Ad
sup
ζ2Π2
Ad Hζ12(0,x) = sup
ζ2Π2
Ad Hˆ
ζ12(0,x) = inf
ζ1Π1
Ad Hζ1,ˆ
ζ2(0,x).(36)
Now arguing as in Theorem 4, it follows that for
ˆ
ζ1
Π
1
SM
there exists a function
˜
ψC1
W,W1
([0
,ˆ
T
]
×X
)
such that
˜
ψ(s, x)eλg(x)
=ˆ
T
s
inf
η∈P(V(x))λc(x, ˆ
ζ1(·|x, t))˜
ψ(t, x)+X
˜
ψ(t, y)q(dy|x, ˆ
ζ1(·|x, t))dt
s[0,ˆ
T],xX,(37)
satisfying
˜
ψ(0,x) = inf
ζ2Π2
Ad Hˆ
ζ12(0,x)(38)
and
˜
ψ(t, x) = inf
ζ2Π2
mHˆ
ζ12(t, x).(39)
Then by
(6)
,
(36)
,
(37)
,
(38)
,
(39)
, Theorem 2, Theorem 4, we say that
ˆ
ζ1
is outer maximizing selector
of (6). By analogous arguments, ˆ
ζ2is outer minimizing selector of (6).
https://doi.org/10.17993/3cemp.2022.110250.76-92
89
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
5 EXAMPLE
This section is dedicated for an example to validate assumptions in this paper, where transition and
cost functions are not bounded.
Example 1. Consider a model of a zero-sum game as
G:= {X,(U, U(x),x X),(V,V (x),xX),c(x, u, v),q(dy|x, u, v)}.
Suppose our state space is X=(−∞,)and transition rate is given by
q(ˆ
D|x, u, v)=ˆ
λ(x, a, b)yˆ
D
1
2πσe(yx)2
2σ2dy δx(ˆ
D),xX,ˆ
DB(X),(u, v)U(x)×V(x).
(40)
We take the following requirements to see if our model has a saddle-point equilibrium.
(I) U
(
x
)and
V
(
x
)are compact subsets of the Borel spaces
U
and
V
, respectively, for each fixed
xX,.
(II)
The payoff function
c
(
x, u, v
)and the rate function
ˆ
λ
(
x, u, v
)are continuous on
U
(
x
)
×V
(
x
), for
each xS. Also, assume that e2λ(ˆ
T+1)|c(x,u,v)|M2W(x),e2λ(ˆ
T+1)|g(x)|M2W(x)
and 0<sup(u,v)U(x)×V(x)ˆ
λ(x, u, v)M0(x2+ 1) for each (x, u, v)K.
Proposition 2. In view of conditions (I)-(II), Assumptions 1, 2, and 3 are satisfied by above controlled
system. Therefore, the existence of a saddle point equilibrium is proved by Theorem 4.
Proof. See Guo and Zhang (2019), Proposition 5.1.
6 CONCLUSIONS
A finite-time horizon dynamic zero-sum game with risk-sensitive cost criteria on a Borel state space
is studied. Here for each state
x
, the admissible action spaces (
U
(
x
)and
V
(
x
)) are compact metric
spaces and costs and transition rate functions are unbounded. Under certain assumptions, we have
solved the Shapley equation and have established a saddle point equilibrium.
Risk-sensitive non-zero-sum game with unbounded rates (costs and transition rates) over countable
state space was investigated in Wei (2019). It would be a challenging problem to study the same
problem on the Borel state space.
ACKNOWLEDGMENT
The second named author’s research is supported partially by SERB, India, grant MTR/2021/000307.
REFERENCES
[1]
Athreya, K.B.,&Lahiri, S.N. (2006). Measure Theory and Probability Theory, Springer, New
York.
[2]
Bauerle, E., &Rieder, U. (2017). Zero-sum risk-sensitive stochastic games, Stoch. Process. Appl.,
127, 622-642.
[3] Bell, D.E. (1995). Risk, return and utility, Manage Sci., 41, 23-30.
[4]
Bielecki, T.R., &Pliska, S.R. (1999). Risk-sensitive dynamic optimization, Appl. Math. Optim.,
39, 337-360.
https://doi.org/10.17993/3cemp.2022.110250.76-92
[5] Fan, K. (1953). Minimax Theorems, Proc. Natl. Acad. Sci., USA, 39, 42-47.
[6]
Feinberge, E.A. Kasyanov, P.O. &Zadoianchuk, N.V., (2014). Fatou’s lemma for weakly
convergening probabilities, Theory Probab. Appl., 58, 683-689.
[7]
Ghosh, M.K., Kumar, K.S., &Pal, C. (2016). Zero-sum risk-sensitive stochastic games for
continuous-time Markov chains, Stoch. Anal. Appl., 34, 835-851.
[8]
Ghosh, M.K., Golui, S., Pal. C. &Pradhan, S. (2022). Nonzero-Sum Risk-Sensitive Continuous-
Time Stochastic Games with Ergodic Costs, Appl. Math. Optim.,https://doi.org/10.1007/s00245-
022-09878-9.
[9]
Ghosh, M.K., &Saha, S. (2014). Risk-sensitive control continuous-time Markov chains, Stochastics,
86, 655-675.
[10]
Golui, S., &Pal, C. (2021a). Continuous-time zero-sum games for Markov chains with risk-
sensitive finite-horizon cost criterion, Stoch. Anal. Appl., 40, 78-95.
[11]
Golui, S., Pal, C. (2021b). Continuous-time zero-sum games for Markov decision pro-
cesses with discounted risk-sensitive cost criterion on a general state space, Stoch. Anal.
Appl.,https://doi.org/10.1080/07362994.2021.2013889.
[12]
Golui, S., &Pal, C. (2022). Risk-sensitive discounted cost criterion for continuous-time Markov
decision processes on a general state space, Math. Meth. Oper. Res., 95, 219-247.
[13]
Golui, S., Pal, C., &Saha, S. (2022). Continuous-Time Zero-Sum Games for Markov Decision
Processes with Discounted Risk-Sensitive Cost Criterion, Dyn. Games Appl., 12, 485-512.
[14]
Guo, X.P., &Hernandez-Lerma, O. (2003). Zero-sum games for continuous-time Markov
chains with unbounded transition and average payoff rates, J. Appl. Probab., 40, 327-345.
[15]
Guo, X.P., &Hernandez-Lerma, O. (2005). Nonzero-sum games for continuous-time Markov
chains with unbounded discounted payoffs, J. Appl. Probab., 303-320.
[16]
Guo, X.P., &Hernandez-Lerma, O. (2007). Zero-sum games for continuous-time jump Markov
processes in Polish spaces: discounted payoffs, Adv. Appl. Probab., 645-668.
[17]
Guo, X.P. (2007). Continuous-time Markov decision processes with discounted rewards: the case
of Polish spaces, Math. Oper. Res., 32, 73-87.
[18]
Guo, X.P., &Hernandez-Lerma, O. (2009). Continuous-time Markov decision processes:
Theory and Applications, Springer, Berlin.
[19]
Guo, X.P., Huang X., &Huang, Y. (2015). Finite-horizon optimality for continuous-time
Markov decision processes with unbounded transition rates, Adv. Appl. Probab., 47, 1064-1087.
[20]
Guo, X.P, Huang, Y., &Song, X. (2012). Linear programming and constrained average
optimality for general continuous-time Markov decision processes in history-dependent polices, Siam
J. Control Optim., 50, 23-47.
[21]
Guo, X.P. &Liao, Z.W. (2019). Risk-sensitive discounted continuous-time Markov decision
processes with unbounded rates, SIAM J. Control Optim., 57, 3857-3883.
[22]
Guo, X., Liu, Q., &Zhang, Y. (2019). Finite-horizon risk-sensitive continuous-time Markov
decision processes with unbounded transition and cost rates, 4OR, 427-442.
[23]
Guo, X. &Piunovskiy, A. (2011). Discounted continuous-time Markov decision processes with
constraints: Unbounded transition and loss rates, Math. Oper. Res., 36, 105-132.
[24]
Guo, X.P., &Song, X. (2011). Discounted continuous-time constrained Markov decision processes
in polish spaces, Ann. Appl. Probab., 21, 2016-2049.
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
90
5 EXAMPLE
This section is dedicated for an example to validate assumptions in this paper, where transition and
cost functions are not bounded.
Example 1. Consider a model of a zero-sum game as
G:= {X,(U, U(x),x X),(V,V (x),xX),c(x, u, v),q(dy|x, u, v)}.
Suppose our state space is X=(−∞,)and transition rate is given by
q(ˆ
D|x, u, v)=ˆ
λ(x, a, b)yˆ
D
1
2πσe(yx)2
2σ2dy δx(ˆ
D),xX,ˆ
DB(X),(u, v)U(x)×V(x).
(40)
We take the following requirements to see if our model has a saddle-point equilibrium.
(I) U
(
x
)and
V
(
x
)are compact subsets of the Borel spaces
U
and
V
, respectively, for each xed
xX,.
(II)
The payoff function
c
(
x, u, v
)and the rate function
ˆ
λ
(
x, u, v
)are continuous on
U
(
x
)
×V
(
x
), for
each xS. Also, assume that e2λ(ˆ
T+1)|c(x,u,v)|M2W(x),e2λ(ˆ
T+1)|g(x)|M2W(x)
and 0<sup(u,v)U(x)×V(x)ˆ
λ(x, u, v)M0(x2+ 1) for each (x, u, v)K.
Proposition 2. In view of conditions (I)-(II), Assumptions 1, 2, and 3 are satisfied by above controlled
system. Therefore, the existence of a saddle point equilibrium is proved by Theorem 4.
Proof. See Guo and Zhang (2019), Proposition 5.1.
6 CONCLUSIONS
A finite-time horizon dynamic zero-sum game with risk-sensitive cost criteria on a Borel state space
is studied. Here for each state
x
, the admissible action spaces (
U
(
x
)and
V
(
x
)) are compact metric
spaces and costs and transition rate functions are unbounded. Under certain assumptions, we have
solved the Shapley equation and have established a saddle point equilibrium.
Risk-sensitive non-zero-sum game with unbounded rates (costs and transition rates) over countable
state space was investigated in Wei (2019). It would be a challenging problem to study the same
problem on the Borel state space.
ACKNOWLEDGMENT
The second named author’s research is supported partially by SERB, India, grant MTR/2021/000307.
REFERENCES
[1]
Athreya, K.B.,&Lahiri, S.N. (2006). Measure Theory and Probability Theory, Springer, New
York.
[2]
Bauerle, E., &Rieder, U. (2017). Zero-sum risk-sensitive stochastic games, Stoch. Process. Appl.,
127, 622-642.
[3] Bell, D.E. (1995). Risk, return and utility, Manage Sci., 41, 23-30.
[4]
Bielecki, T.R., &Pliska, S.R. (1999). Risk-sensitive dynamic optimization, Appl. Math. Optim.,
39, 337-360.
https://doi.org/10.17993/3cemp.2022.110250.76-92
[5] Fan, K. (1953). Minimax Theorems, Proc. Natl. Acad. Sci., USA, 39, 42-47.
[6]
Feinberge, E.A. Kasyanov, P.O. &Zadoianchuk, N.V., (2014). Fatou’s lemma for weakly
convergening probabilities, Theory Probab. Appl., 58, 683-689.
[7]
Ghosh, M.K., Kumar, K.S., &Pal, C. (2016). Zero-sum risk-sensitive stochastic games for
continuous-time Markov chains, Stoch. Anal. Appl., 34, 835-851.
[8]
Ghosh, M.K., Golui, S., Pal. C. &Pradhan, S. (2022). Nonzero-Sum Risk-Sensitive Continuous-
Time Stochastic Games with Ergodic Costs, Appl. Math. Optim.,https://doi.org/10.1007/s00245-
022-09878-9.
[9]
Ghosh, M.K., &Saha, S. (2014). Risk-sensitive control continuous-time Markov chains, Stochastics,
86, 655-675.
[10]
Golui, S., &Pal, C. (2021a). Continuous-time zero-sum games for Markov chains with risk-
sensitive finite-horizon cost criterion, Stoch. Anal. Appl., 40, 78-95.
[11]
Golui, S., Pal, C. (2021b). Continuous-time zero-sum games for Markov decision pro-
cesses with discounted risk-sensitive cost criterion on a general state space, Stoch. Anal.
Appl.,https://doi.org/10.1080/07362994.2021.2013889.
[12]
Golui, S., &Pal, C. (2022). Risk-sensitive discounted cost criterion for continuous-time Markov
decision processes on a general state space, Math. Meth. Oper. Res., 95, 219-247.
[13]
Golui, S., Pal, C., &Saha, S. (2022). Continuous-Time Zero-Sum Games for Markov Decision
Processes with Discounted Risk-Sensitive Cost Criterion, Dyn. Games Appl., 12, 485-512.
[14]
Guo, X.P., &Hernandez-Lerma, O. (2003). Zero-sum games for continuous-time Markov
chains with unbounded transition and average payoff rates, J. Appl. Probab., 40, 327-345.
[15]
Guo, X.P., &Hernandez-Lerma, O. (2005). Nonzero-sum games for continuous-time Markov
chains with unbounded discounted payoffs, J. Appl. Probab., 303-320.
[16]
Guo, X.P., &Hernandez-Lerma, O. (2007). Zero-sum games for continuous-time jump Markov
processes in Polish spaces: discounted payoffs, Adv. Appl. Probab., 645-668.
[17]
Guo, X.P. (2007). Continuous-time Markov decision processes with discounted rewards: the case
of Polish spaces, Math. Oper. Res., 32, 73-87.
[18]
Guo, X.P., &Hernandez-Lerma, O. (2009). Continuous-time Markov decision processes:
Theory and Applications, Springer, Berlin.
[19]
Guo, X.P., Huang X., &Huang, Y. (2015). Finite-horizon optimality for continuous-time
Markov decision processes with unbounded transition rates, Adv. Appl. Probab., 47, 1064-1087.
[20]
Guo, X.P, Huang, Y., &Song, X. (2012). Linear programming and constrained average
optimality for general continuous-time Markov decision processes in history-dependent polices, Siam
J. Control Optim., 50, 23-47.
[21]
Guo, X.P. &Liao, Z.W. (2019). Risk-sensitive discounted continuous-time Markov decision
processes with unbounded rates, SIAM J. Control Optim., 57, 3857-3883.
[22]
Guo, X., Liu, Q., &Zhang, Y. (2019). Finite-horizon risk-sensitive continuous-time Markov
decision processes with unbounded transition and cost rates, 4OR, 427-442.
[23]
Guo, X. &Piunovskiy, A. (2011). Discounted continuous-time Markov decision processes with
constraints: Unbounded transition and loss rates, Math. Oper. Res., 36, 105-132.
[24]
Guo, X.P., &Song, X. (2011). Discounted continuous-time constrained Markov decision processes
in polish spaces, Ann. Appl. Probab., 21, 2016-2049.
https://doi.org/10.17993/3cemp.2022.110250.76-92
91
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
[25]
Guo, X.P., &Zhang, J. (2018). On risk-sensitive piecewise deterministic Markov decision
processes, Appl. Math. Optim. 81, 685-710.
[26]
Guo, X.P., &Zhang, J. (2019). Risk-sensitive continuous-time Markov decision processes with
unbounded rates and Borel spaces, Discrete Event Dyn. Syst., 29, 445-471.
[27]
Hernandez-Lerma, O., &Lasserre, J., (1999). Further topics on discrete-time Markov control
processes, Springer, New York.
[28]
Huang, Y. (2018). Finite horizon continuous-time Markov decision processes with mean and
variance criteria, Discrete Event Dyn. Sys., 28, 539–564.
[29]
Kitaev, M.Y. (1986). Semi-Markov and jump Markov controlled models: Average cost criterion,
Theory Probab. Appl., 30, 272-288.
[30]
Kitaev, M.Y., &Rykov, V.V. (1995). Controlled Queueing Systems, CRC Press, Boca Raton.
[31]
Kumar, K.S., &Pal, C. (2013). Risk-sensitive control of jump process on denumerable state
space with near monotone cost, Appl. Math. Optim., 68, 311-331.
[32]
Kumar, K.S., &Pal, C. (2015). Risk-sensitive control of continuous-time Markov processes with
denumerable state space, Stoch. Anal. Appl., 33, 863-881.
[33]
Nowak, A.S. (1985). Measurable selection theorems for minimax stochastic optimization problems,
SIAM J. Control Optim., 23, 466-476.
[34]
Pal, C., &Pradhan, S. (2019). Risk-sensitive control of pure jump processes on a general state
space, Stochastics, 91, 155-174.
[35]
Piunovsiy, A., &Zhang, Z. (2011). Discounted continuous-time Markov decision processes with
unbounded rates: The convex analytic approach, SIAM J. Control Optim., 49, 2032-2061.
[36]
Piunovsiy, A., &Zhang, Y. (2014). Discounted continuous-time Markov decision processes with
unbounded rates and randomized history-dependent policies: the dynamic programming approach,
4OR-Q. J. Oper. Res., 12, 49–75.
[37]
Piunovsiy, A., &Zhang, Y. (2020). Continuous-time Markov decision processes. In: Probability
Theory and Stochastic Modelling, Springer,https://doi.org/10.1007/978-3-030-54987-9.
[38]
Wei, Q. (2016). Continuous-time Markov decision processes with risk-sensitive finite-horizon cost
criterion, Math. Meth. Oper. Res., 84, 461-487.
[39]
Wei, Q. (2017). Zero-sum games for continuous-time Markov jump processes with risk-sensitive
finite-horizon cost criterion, Oper. Res. Lett., 46, 69-75.
[40]
Wei, Q. (2019). Nonzero-sum risk-sensitive finite-horizon continuous-time stochastic games, Stat.
Probab. Lett., 147, 96-104.
[41]
Wei, Q., &Chen, X. (2016). Stochastic games for continuous-time jump processes under
finite-horizon payoff criterion, Appl. Math. Optim., 74, 273-301.
[42]
Zhang, W.Z., &Guo, X.P. (2012). Nonzero-sum games for continuous-time Markov chains with
unbounded transition and average payoff rates, Sci. China Math., 55, 2405-2416.
[43]
Zhang, Y. (2017). Continuous-time Markov decision processes with exponential utility, SIAM J.
Control Optim., 55, 2636-2660.
https://doi.org/10.17993/3cemp.2022.110250.76-92
3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376
Ed. 50 Vol. 11 N.º 2 August - December 2022
92