导图社区强化学习

强化学习

强化学习第二版第八章章知识点，原书为 Reinforcement Learning 2nd Edition，本图知识梳理清楚，非常实用，值得收藏。

编辑于2021-09-19 15:35:17

强化学习
RL
Reinforce…

SeptemberHX

他的近期作品查看更多>>

强化学习

社区模板帮助中心，点此进入>>

SeptemberHX

他的近期作品查看更多>>

相似推荐
大纲

安全教育的重要性
- 6.7k
- 892
- 98
- 18
issen
个人日常活动安排思维导图
- 7.3k
- 0
- 80
- 1
少儿栏目外景策划波波老师
西游记主要人物性格分析
- 15.4k
- 1.4k
- 639
- 105
issen
17种头脑风暴法
- 200.3k
- 4.0k
- 11.7k
- 4.0k
MindMaster
如何令自己更快乐
- 2.5k
- 27
- 98
- 6
wxb
头脑风暴法四个原则
- 1.9k
- 193
- 69
- 5
issen
思维导图
- 19.7k
- 2.4k
- 449
- 80
Jason
第二职业规划书
- 2.9k
- 3
- 68
- 0
~九梦离殇~
记一篇有颜又有料的笔记-by babe
- 716
- 9
- 32
- 3
橘大喵
伯赞学习技巧
- 788
- 17
- 47
- 9
安浪

强化学习8

Models and Planning

model: simulate the environment and produce simulated experience

model-based

distribution models

sample models

Distribution models 优于 sample models，因为前者能够用于采样；但是许多场景下更容易得到一个 sample model

model-free

planning

类型

State-space planning

Plan-space planning

state-space planning 算法的统一视图(unified view)

1. 计算价值函数(value functions)是提升策略(policy)的关键中间步骤

2. 通过对 simulated experience 做 updates/backup 操作来计算价值函数

对于许多 state-space planning 方法都适用，仅仅是 updates 的方法不同

planning uses simulated experience generated by a model, learning methods use real experience generated by the environment

Dyna-Q

a simple architecture integrating the major functions needed in an online planning agent

按 real experience 来划分

model-learning: 用于提升模型(model)使得模型贴近真实环境(real environment) 能够更好的利用数量有限的 experience 并通过较少的环境交互实现较好的策略

direct reinforcement learning: 用于直接提升价值函数和策略更容易且不会受到模型的 bias 影响

General Dyna Architecture

1. search control：模型通过选择起始状态及动作以生成 simulated experiences 的过程

2. Dyna-Q 中，learning from real experience 以及 planning from simulated experience 会相同的强化学习方法

Dyna-Q+

模型不正确的时候，planning 可能会陷入局部最优解

exploration：尝试能够提升模型的动作（广度）

exploitation：按照当前模型执行最优方案（深度）

比如在中途将寻路问题从左边变为右边，当模型找到左边最优之后，难以在短时间内找到右边的最优：显而易见，即便使用贪婪策略，也很难在短时间内迭代到新的最优解

Dyna-Q+ 记录每个状态-动作组合最后一次出现到当前时间的时间间隔

time steps 越多，这个组合就有越高的机会发生变化以及模型是错误的

子主题

Prioritized Sweeping

(f) 步骤中是从出现过的状态-动作组合中均匀选取，显然会涉及到很多和最优解无关的组合，降低了效率（unfocus searching）

backward focusing：从价值变化的状态往回更新，反向传播的感觉

前继节点价值变化较大的节点应该更容易变化

相较于没有使用 prioritized sweeping 的 Dyna-Q 而言，有着绝对优势(decisive advantage)

可以使用 expected 方式来代替 sample 方式进行更新

Expected vs. Sample Updates

expected：考虑所有可能出现的情况及概率。例如针对 Q：

sample：考虑一个可能出现的样本。例如针对 Q：

不考虑计算时间的情况下，expected 优于 sample；在状态空间很大的问题中，有着高随机分支率（large stochastic branching factors）的 sample 表现可能会（likely）比 expected 好

Trajectory Sampling

Exhaustive sweep: 会在状态空间中的所有状态上投入相同的时间

Trajectory Sampling：模拟生成多个相互独立的轨迹样本，并按照轨迹对出现的状态-动作进行更新

on-policy 的 trajectory sampling 在大型问题中比较有优势

Real-time Dynamic Programming (RTDP)

按照 means of expected tabular value-iteration 更新在真实/模拟轨迹中出现过的状态的价值

得到的是 optimal partial policy

RTDP is guaranteed to find a policy that is optimal on the relevant states without visiting every state infinitely often, or even without visiting some states at all

一定收敛到最优的条件

初始目标状态值为0

一个目标状态至少要有一个概率为1的路径能够从任意起始状态到达目标状态

到非目标状态的奖励为负

初始值等于或者大于最优值

代价最小化而不是奖励最大化

Planning

background planning

用于从样本中逐渐提升策略、价值函数

是 tabular 中更新的重要部分

适合于计算时间紧张的场景，可以在后面得到一个 policy 来指导使用时的动作选择

decision-time planning

注重于对当前状态的决策：为当前状态计算出一个动作

适用于计算时间较长的场景，比如象棋

Heuristic Search

针对每个遇到的状态，后考虑可能后续构成的树

Rollout Algorithms

基于从当前状态开始的模拟轨迹的蒙特卡洛控制方法

不是寻找给定策略的完整最优动作价值函数

给定策略称为：rollout policy

通过 rollout policy 对当前状态的动作价值计算蒙特卡洛估计，并在使用后抛弃，不保存

目标是提升 rollout policy 而不是寻找最优策略

Intuition suggests that the better the rollout policy and the more accurate the value estimates, the better the policy produced by a rollout algorithm is likely be

Monte Carlo Tree Search

是一种 rollout algorithm

MCTS has proved to be effective in a wide variety of competitive settings, including general game playing

核心思想：extending the initial portions of trajectories that have received high evaluations from earlier simulations

步骤

1. Selection：模拟路径够成树，使用 tree policy 从根节点开始，选择出一个最好的叶子节点

2. Expansion：对选择的节点通过未探索动作继续拓展

3. Simulation：使用 rollout policy 对步骤 1 2 选择的节点生成完整的 episode。 The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy

4. Backup：原路返回 backup 更新

总结