• A finite set of states St summarizing the information the agent senses from the environment at every time step t ∈ {1, ..., T}.

• A set of actions At which the agent can perform at each time step t ∈ {1, ..., T} to interact with the environment.

• A set of transition probabilities between subsequent states which render the environment stochastic. Note: the probabilities are usually not explicitly modeled but the result of the stochastic nature of the financial asset’s price process.

• A reward (or return) function Rt which provides a numerical feedback value rt to the agent in response to its action At−1 = at−1 in state St−1 = st−1.

• A policy π which maps states to concrete actions to be carried out by the agent. The policy can hence be understood as the agent’s rules for how to choose actions.

• A value function V which maps states to the total (discounted) reward the agent can expect from a given state until the end of the episode (trading period) under policy π.

Given the above framework, the decision problem is formalized as finding the optimal policy π = π ∗ , i.e., the mapping from states to actions, corresponding to the optimal value function V ∗ - see also Dempster et al. (2001); Dempster and Romahi (2002):

  V (st) = max at E[Rt+1 + γV (St+1)|St = st ].(1)

Hereby, E denotes the expectation operator, γ the discount factor, and Rt+1 the expected immediate reward for carrying out action At = at in state St = st . Further, St+1 denotes the next state of the agent. The value function can hence be understood as a mapping from states to discounted future rewards which the agent seeks to maximize with its actions.

To solve this optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending the above equation to the level of state-action tuples:

  Q (st , at) = E[Rt+1 + γ max at+1 Q (St+1, at+1)|St = st , At = at ].(2)

Hereby, the Q-value Q∗ (st , at) equals to the immediate reward for carrying out action At = at in state St = st plus the discounted future reward from carrying on in the best way possible.

The optimal policy π (the mapping from states to actions) then simply becomes:

  π (st) = max at Q (st , at) .(3)

i.e., in every state St = st , choose the action At = at that yields the highest Q-value. To approximate the Q-function during (online) learning, an iterative optimization is carried out with α denoting the learning rate - see also Sutton and Barto (1998) for further details:

  Q (st , at) ← (1 − α) Q (st , at) + α (rt+1 + γ max at+1 Q (st+1, at+1) ) . (4)

最新文章

  1. JS兼容所有浏览器的一段加入收藏代码,设置为首页
  2. 学习quartz定时
  3. APMServ 配置记录
  4. java发展道路
  5. Jquery 知识表
  6. 关于htons和htonl
  7. Java-马士兵设计模式学习笔记-观察者模式-OOD 线程 改进
  8. AutoLayout的三种设置方式之——NSLayoutConstraint代码篇
  9. Squid 反向代理加速网站
  10. windows下redis 开机自启动
  11. Java标准输入输出流的重定向及恢复
  12. MP算法和OMP算法及其思想
  13. Pro/E 5.0安装图解教程(也适用于Creo Elements/Pro 5.0)
  14. Python3基础 list() 将一个字符串转换成列表
  15. MySQL常用的查询命令
  16. OpenAL音频库例程
  17. vim 学习笔记系列(前言)
  18. power designer 一般常用快捷键(转)
  19. Linux 下编译 有多个子程序文件的Fortran程序
  20. python学习笔记05:贪吃蛇游戏代码

热门文章

  1. Each record in table should have a unique `key` prop,or set `rowKey` to an unique primary key.
  2. Search in a Binary Search Tree
  3. 3dmax————
  4. Struts2拦截器再认识
  5. JQuery入门一
  6. python进阶03 继承
  7. [洛谷P4315] 月下”毛景“树
  8. Hive进阶_Hive的子查询
  9. IOC的使用
  10. 关于Servlet中的转发和重定项