CONTINUOUS-TIME ZERO-SUM GAMES FOR MAR-

KOV DECISION PROCESSES WITH RISK-SENSITIVE

FINITE-HORIZON COST CRITERION ON A GENERAL

STATE SPACE

Subrata Golui

(M.Sc. in Mathematics, Indian Institute of Engineering Science and Technology Shibpur). Senior Research Scholar,

Department of Mathematics, Indian Institute of Technology Guwahati. Guwahati (India).

golui@iitg.ac.in:

https://orcid.org/0000-0001-7232-562X:

Chandan Pal

(PhD in Mathematics, Indian Institute of Technology Bombay). Assistant Professor, Department of Mathematics,

Indian Institute of Technology Guwahati. Guwahati (India).

cpal@iitg.ac.in:

https://orcid.org/0000-0002-4684-0481:

Reception: 27/08/2022 Acceptance: 11/09/2022 Publication: 29/12/2022

Suggested citation:

Subrata Golui and Chanda Pal (2022). Continuous-time zero-sum games for Markov decision processes with risk-sensitive

ﬁnite-horizon cost criterio on a general state space. 3C Empresa. Investigación y pensamiento crítico,11 (2), 76-92.

https://doi.org/10.17993/3cemp.2022.110250.76-92

ABSTRACT

In this manuscript, we study continuous-time risk-sensitive ﬁnite-horizon time-homogeneous zero-sum

dynamic games for controlled Markov decision processes (MDP) on a Borel space. Here, the transition

and payoﬀ functions are extended real-valued functions. We prove the existence of the game’s value

and the uniqueness of the solution of Shapley equation under some reasonable assumptions. Moreover,

all possible saddle-point equilibria are completely characterized in the class of all admissible feedback

multi-strategies. We also provide an example to support our assumptions.

KEYWORDS

Zero-sum stochastic game, Borel state space, risk-sensitive utility, ﬁnite-horizon cost criterion, optimality

equation, saddle-point

https://doi.org/10.17993/3cemp.2022.110250.76-92

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376

Ed. 50 Vol. 11 N.º 2 August - December 2022

CONTINUOUS-TIME ZERO-SUM GAMES FOR MAR-

KOV DECISION PROCESSES WITH RISK-SENSITIVE

FINITE-HORIZON COST CRITERION ON A GENERAL

STATE SPACE

Subrata Golui

(M.Sc. in Mathematics, Indian Institute of Engineering Science and Technology Shibpur). Senior Research Scholar,

Department of Mathematics, Indian Institute of Technology Guwahati. Guwahati (India).

golui@iitg.ac.in:

https://orcid.org/0000-0001-7232-562X:

Chandan Pal

(PhD in Mathematics, Indian Institute of Technology Bombay). Assistant Professor, Department of Mathematics,

Indian Institute of Technology Guwahati. Guwahati (India).

cpal@iitg.ac.in:

https://orcid.org/0000-0002-4684-0481:

Reception: 27/08/2022 Acceptance: 11/09/2022 Publication: 29/12/2022

Suggested citation:

Subrata Golui and Chanda Pal (2022). Continuous-time zero-sum games for Markov decision processes with risk-sensitive

ﬁnite-horizon cost criterio on a general state space. 3C Empresa. Investigación y pensamiento crítico,11 (2), 76-92.

https://doi.org/10.17993/3cemp.2022.110250.76-92

ABSTRACT

In this manuscript, we study continuous-time risk-sensitive ﬁnite-horizon time-homogeneous zero-sum

dynamic games for controlled Markov decision processes (MDP) on a Borel space. Here, the transition

and payoﬀ functions are extended real-valued functions. We prove the existence of the game’s value

and the uniqueness of the solution of Shapley equation under some reasonable assumptions. Moreover,

all possible saddle-point equilibria are completely characterized in the class of all admissible feedback

multi-strategies. We also provide an example to support our assumptions.

KEYWORDS

Zero-sum stochastic game, Borel state space, risk-sensitive utility, ﬁnite-horizon cost criterion, optimality

equation, saddle-point

https://doi.org/10.17993/3cemp.2022.110250.76-92

3C Empresa. Investigación y pensamiento crítico. ISSN: 2254-3376

Ed. 50 Vol. 11 N.º 2 August - December 2022

1 INTRODUCTION

In the literature of game theory, there are two types of game models: a zero-sum model and a nonzero-

sum game model. We know that in the zero-sum two-person game, one player tries to maximize his/her

payoﬀ and another tries to minimize his/her payoﬀ, whereas in the nonzero-sum game, both players

try to minimize their payoﬀ. We can study game theory either in discrete-time or in continuous-time.

In continuous time, the players observe the state space continuously, whereas in discrete-time, they

observe the state space in discrete-time. Also, there are two types of game models with respect to

the risk measure; one is a risk-neutral game, and another is a risk-sensitive game. Risk-sensitive, or

“exponential of integral utility” cost criterion is popular, particularly in ﬁnance (see, e.g., Bielecki and

Pliska (1999)) since it has the property to capture the eﬀects of more than ﬁrst order (expectation)

moments of the cost.

There are large number of literatures for the risk-neutral utility cost criterion for continuous-time

controlled Markov decision processes (CTCMDPs) with diﬀerent setup, see Guo (2007), Guo and

Hernandez-Lerma (2009), Guo et al. (2015), Guo et al. (2012), Guo and Piunovskiy (2011), Huang

(2018), Piunovskiy and Zhang (2011), Piunovskiy and Zhang (2014) for single controller model and

Guo and Hernandez-Lerma (2003), Guo and Hernandez-Lerma (2005), Guo and Hernandez-Lerma

(2007), wei and Chen (2016), Zhang and Guo (2012) for game models. Players want to ignore risk in

risk-neutral stochastic games because of the additive feature of this criterion. If the variance is high,

the risk-neutral criterion is not useful since there can be issues with optimal control. Regarding risk

preferences, diﬀerent controllers may exhibit various perspectives. So, risk preferences are considered

by the decision-makers to be the performance criterion. Bell (1995) gave a model containing the

interpretation of risk-sensitive utility. This paper considers ﬁnite-horizon risk-sensitive two-person

zero-sum dynamic games for controlled CTMDPs with unbounded rates (transition and payoﬀ rates)

under admissible feedback strategies. State and action spaces are considered to be Borel spaces. The

main target of this manuscript is to ﬁnd the solution of the optimality equation

(6)

(Shapley equation),

to provide the proof of the existence of game’s value, and to give a proof of complete characterization

of saddle-point equilibrium.

The ﬁnite-horizon optimality criterion generally comes up in real-life scenarios. where the cost

criterion may not be risk-neutral. For ﬁnite-horizon risk-neutral CTMDPs, see Guo et al. (2015),

Huang (2018) while for the corresponding game, see Wei and Chen (2016) and its references. In this

context, for risk-sensitive ﬁnite-horizon controlled CTMDP, one can see Ghosh and Saha (2014), Guo

et al. (2019), Wei (2016), while the research for inﬁnite-horizon risk-sensitive CTMDP are available

in, Ghosh and Saha (2014), Golui and Pal (2022), Guo and Zhang (2018), Kumar and Pal (2013),

Kumar and Pal (2015), Zhang (2017) and the references therein. At the same time the corresponding

ﬁnite/inﬁnite-horizon dynamic games are studied in Ghosh et at. (2022), Golui and Pal (2021a), Golui

and Pal (2021b), Golui et al. (2022), wei (2019). Study on CTMDPs for risk-sensitive control on a

denumerable state space are available greatly, see Ghosh and saha (2014), Guo and Liao (2019), Guo et

al. (2019) but some times we see the countable state space dose not help to study some models specially

in chemical reactions problem, water reservoir management problem, inventory problem, cash-ﬂow

problem, insurance problem etc. We see that the literature in controlled CTMDPs considering on

general state space is very narrow. Some exceptions for single controller are Golui and Pal (2022),

Guo et al. (2012), Guo and Zhang (2019), Pal and Pradhan (2019), Piunovskiy and Zhang (2014),

Piunovskiy and Zhang (2020) and for corresponding stochastic games are Bauerle and Rieder (2017),

Golui and Pal (2021b), Guo and Hernandez-Lerma (2007), Wei (2017). So, it is very interesting and

very important to consider the game problem in some general state space. In Guo and Zhang (2019), the

authors studied the same as in Guo et al. (2019) but on general state space, whereas in Wei (2017), the

ﬁnite-horizon risk-sensitive zero-sum game for a controlled Markov jump process with bounded costs

and unbounded transition rates was studied. Where in Ghosh et al. (2016), the authors studied dynamic

games on the inﬁnite-horizon for controlled CTMDP by considering bounded transition and payoﬀ rates.

However this boundedness condition is a restrictive conditon for many real life scenarios. Someone may

note queuing and population processes for the requirement of unboundedness in transition and payoﬀ

functions. In Golui and Pal (2021a), ﬁnite-horizon continuous-time risk-sensitive zero-sum games for

https://doi.org/10.17993/3cemp.2022.110250.76-92

unbounded transition and payoﬀ function on countable state space is considered. But the extension

of the same results to a general Borel state space were unknown to us. We solve this problem in this

paper. Here we are dealing with ﬁnite-horizon risk-sensitive dynamic games employing the unbounded

payoﬀ and transition rates in the class of all admissible feedback strategies on some general Borel

state space, whose results were unknown until now. In this paper, we try to ﬁnd the solution to the

risk-sensitive ﬁnite-horizon optimality equation and, at the same time, try to obtain the existence of

an optimal equilibrium point for this jump process. We take homogeneous game model. In Theorem

4, we prove our ﬁnal results, i.e., we show that if the cost rates are real-valued functions, then the

Shapley equation (6), has a solution. The existence of optimal-point equilibria is proved by using the

measurable selection theorem in Nowak (1985). The claim of uniqueness of the solution is due to the

well known Feynman-Kac formula. The value of the game has also been established.

The remaining portions of this work are presented. Section 2 describes the model of our stochastic

game, some deﬁnitions, and the ﬁnite-horizon cost criterion. In Section 3, preliminary results, conditions,

and the extension of the Feynman-Kac formula are provided. Also, we establish the probabilistic

representation of the solution of the ﬁnite horizon optimality equation (6) there. The uniqueness of this

optimal solution as well as the game’s value are proved in section 4. We also completely characterize

the Nash equilibrium among the class of admissible Markov strategies for this game model here. In

Section 5, we verify our results with an example.

2 THE ZERO-SUM DYNAMIC GAME MODEL

First, we introduce a time-homogeneous continuous-time zero-sum dynamic game model in this section,

which contains the following:

G:= {X, U, V, (U(x)⊂U, x ∈X),(V(x)⊂V,x ∈X),q(·|x, u, v),c(x, u, v),g(x)}.(1)

Here

is our state space which is a Borel space and the corresponding Borel

-algebra is

(

The action spaces are

and

for ﬁrst and second players, respectively, and are considered to be

Borel spaces. Their corresponding Borel

-algebras are, respectively,

(

)and

(

). For each

x∈X

the admissible action spaces are denoted by

(

)

∈B

(

)and

(

)

∈B

(

), respectively and these

spaces are assumed to be compact. Now let us deﬁne a Borel subset of

X×U×V

denoted by

K:= {(x, u, v)|x∈X,u∈U(x),v ∈V(x)}.

Next, for any (

x, u, v

)

∈K

, we know that the transition rate of the CTMDPs denoted by

(

·|x, u, v

)

is a signed kernel on

such that

(

D|x, u, v

)

≥

0where (

x, u, v

)

∈K

and

x/∈D

. Also,

(

·|x, u, v

)is

assumed to be conservative i.e., q(X|x, u, v)≡0, as well as stable i.e.,

q∗(x) := sup

u∈U(x),v∈V(x)

[qx(u, v)] <∞∀x∈X,(2)

(

u, v

) :=

−q

(

{x}|x, u, v

)

≥

for all

(

x, u, v

)

∈K.

Our running cost is

, assumed to be measurable on

and the terminal cost is

, assumed to be measurable on

. These costs are taken to be real-valued.

The dynamic game is played as following. The players take actions continuously. At time moment

t≥

0, if the system’s state is

x∈S

, the players take their own actions

ut∈U

(

)and

vt∈V

(

)

independently as their corresponding strategies. As a results the following events occurs:

•

the ﬁrst player gets an reward at rate

(

x, ut,v

)immediately and second player gives a cost at a

rate c(x, ut,v

t); and

•

staying for a random time in state

, the system leaves the state

at a rate given by the quantity

(

ut,v

), and it jumps to a set

x/∈D

) with some probability determined by

q(D|x, ut,v

qx(ut,v

(for details, see Proposition B.8 in Guo and Hernandez-Lerma (2009), p. 205 for details).

Now suppose the system is at a new state

. Then the above operation is replicated till the ﬁxed time

0. Moreover, at time

if the system occupies a state

yˆ

, second player pays a terminal cost

(

yˆ

)

to ﬁrst player.