代码:https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On

Chapter 1 What is Reinforcement Learning

Learning - supervised, unsupervised, and reinforcement

RL is not completely blind as in an unsupervised learning setup--we have a reward system.

(1) life is suffering, which could be totally wrong. In machine learning terms, it can be rephrased as having non-i.i.d data.

(2) exploration/exploitation dilemma is one of the open fundamental question in RL.

(3) the third complication factor lays in the fact that reward can be seriously delayed from actions.

RL fromalisms and realtions

RL entities and their communications

  • Agent和Environment是图的两个node
  • Actions作为edge由Agent指向Environment
  • Rewards和Observations作为edge由Environment指向Agent

Reward

We don't define how frequently the agent receives this reward. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.

The agent

The environment

Action

two types of actions: discrete or continuous.

Observations

Markov decision process

It is the theoretical foundation of RL, which makes it possible to start moving toward the methods used to solve the RL problem.

we start from the simplest case of a Markov Process(also known as a Markov chain), then extend it with rewards, which will turn it into a Markov reward processes. Then we'll put this idea into one other extra envelop by adding actions, which will lead us to Markov Decision Processes.

Markov process

you can always make your model more complex by extending your state space, which will allow you to capture more dependencies in the model at the cost of a large state space.

you can capture transition probabilities with a transition matrix, which is a square matrix of the size NxN, where N is the number of states in your model.

可以根据观测的episodes来估计transition matrix

Markov reward process

first thing is to add reward to Markov process model.

representation: reward transition matrix or a more compact representation, which is applicable only if the reward value depends only on the target state, which is not always the case.

second thing is to add discount factor gamma(from 0 to 1).

Markov decision process

add a dimension 'action' to transition matrix.

Chapter 2 OpenAI Gym

Chapter 3 Deep Learning with PyTorch

Chapter 4 The Cross-Entropy Method

Taxonomy of RL methods

  • Model-free or model-based
  • Value-based or policy-based
  • On-policy or off-policy

Practional cross-entropy

最新文章

  1. 5Hibernate配置及使用方法----青软S2SH(笔记)
  2. EF6 中tracking log使用方法总结
  3. 深入理解Sqlserver索引
  4. 布隆过滤器(Bloom Filter)的原理和实现
  5. 多线程实际运用<第七篇>
  6. bzoj 4567: [Scoi2016]背单词
  7. css的input文本框的 propertychange、focus、blur
  8. python zip文件压缩和解压
  9. Linux下修改环境变量,不小心改错,找不到命令解决办法
  10. 菜鸟入门【ASP.NET Core】11:应用Jwtbearer Authentication、生成jwt token
  11. openstack之虚拟机管理命令
  12. Maven settings配置中的mirrorOf
  13. 使用git工具删除github上的文件或者文件夹
  14. 使用jQuery异步传递含复杂属性及集合属性的Model到控制器方法
  15. Java使用jxl.jar包写Excel文件的最适合列宽问题基本实现
  16. 7个Node.js的Web框架
  17. printf回到上一行开头以及回到本行开头的方法
  18. Python学习(六)模块 —— 第三方库
  19. [BZOJ5073][Lydsy1710月赛]小A的咒语
  20. 多视几何——三角化求解3D空间点坐标

热门文章

  1. 第二十六篇 jQuery 学习8 遍历-父亲兄弟子孙元素
  2. ONNX源码安装
  3. (备忘)Java数据类型中String、Integer、int相互间的转换
  4. 《python解释器源码剖析》第8章--python的字节码与pyc文件
  5. webapi 可空参数
  6. 小程序UI设计(8)-布局分解-FlexBox的align-content应用
  7. Bridge 桥梁模式
  8. mysqldump 原理
  9. BBS-登录注册
  10. AJAX增删查