SAC Algorithm

B站SAC原论文讲解链接：

https://www.bilibili.com/video/BV1YK4y1T7b6/?spm_id_from=333.788&vd_source=54a0edf8490a6a72a5e82c4e543fc3e2

https://www.bilibili.com/video/BV13V411e7Qb/?spm_id_from=333.788&vd_source=54a0edf8490a6a72a5e82c4e543fc3e2

知乎链接：

https://zhuanlan.zhihu.com/p/385658411

SAC收敛性证明

首先解释KL散度$D_{KL}$

KL散度是统计学中度量值，度量两个概率分布之间的差异大小，在信息论中又称为信息熵

对于离散概率分布的KL散度计算为: \[ D_{KL}(P||Q) = \sum_{x∈X}P(x)ln(\frac{P(x)}{Q(x)}) \] 对于连续概率分布的KL散度计算为： $$ D_{KL}(P||Q) = \[\begin{equation*} \int_{-\infty}^{+\infty} P(x)ln(\frac{P(x)}{Q(x)}) dx \end{equation*}\] $$ 特别的，PQ之间的KL散度不等于QP之间的KL散度，即KL散度不具备对称性

在原有的π策略上增加熵和约束使得最小化策略π和Q值得出的分之间的KL散度：

\[ π_{new} = argmin D_{KL}(π^{’}||\frac{exp(Q_{old}(s_{t},·))}{Z_{old}}) \] prove I（$π_{new}总是总是至少不坏于π_{old}$）： $$ \[\begin{equation*} \begin{aligned} D_{KL}(π'(·|s_{t})||\frac{exp(Q^{π_{old}})}{Z^{π_{old}}}) &= -\int_{}^{} π'(a_{t}|s_{t})log(\frac{exp(Q^{π_{old}}(s_{t},a_{t}) - logZ^{π_{old}})}{π'(a_{t}|s_{t})}) dx \\ &= -\int_{}^{}π'(a_{t}|s_{t})(Q_{π_{old}}(s_{t},a_{t}) - logZ^{π_{old}} - logπ'(a_{t}|s_{t}))\\ &= \int_{}^{}π'(a_{t}|s_{t})(-Q_{π_{old}}(s_{t},a_{t}) + logZ^{π_{old}} + logπ'(a_{t}|s_{t}))\\ &= E_{a_{t}\simπ'}[logZ^{π_{old}} + logπ'(a_{t}|s_{t})-Q_{π_{old}}(s_{t},a_{t})] \end{aligned} \end{equation*}\] \[ 得： \] E_{a_{t}{new}}[Q^{π_{old}}(s_{t},a_{t}) - logπ_{new}(a_{t},s_{t})] V^{π{old}}(s_{t}) \[ 带入贝尔曼方差易得 \] Q_{new} Q_{old} $$ 即按照上述策略π进行迭代，新的Q值总是不低于旧的Q值

目标函数

state value function $V_{φ}(s_{t})$

soft Q-function $Q_{θ}(s_{t},a_{t})$

tractable policy $π_{Φ}(a_{t}|s_{t})$

update流程

对于state value网络参数的优化

采用MSE优化目标函数： \[ J_{V}(φ) = E[\frac{1}{2}(V_{φ}(s_{t}) - E[Q_{θ}(s_{t},a_{t}) - logπ_{Φ}(a_{t}|s_{t})])^{2}] \]
对于soft Q-function的网络参数优化

采用MSE进行优化目标函数： \[ J_{Q}(θ) = E[\frac{1}{2}(Q_{θ}(s_{t},a_{t}) - \hat{Q_{θ}}(s_{t},a_{t}))^{2}] \\ \hat{Q}(s_{t},a_{t}) = r + γE[V_{φ}(s_{t+1})] \]
对于策略网络的参数更新 \[ J_{π}(Φ) = E[D_{KL}(π_{Φ}(·|s_{t})||\frac{exp(Q_{θ}(s_{t},·))}{Z_{θ}(s_{t})})] \]

通过KL散度的化简可以将目标的损失函数更新为 \[ J_{π}(Φ) = E[αlog(π_{Φ}(·|s_{t}) - Q_{θ}(s_{t},a_{t})] \]
1. 若按照PG的方法进行梯度下降，则会面临算法只适用于或者在on-policy的场景，在off-policy的场景适用性不好，即使添加了重要性采样
2. 因此对于SAC的策略网络的更新方法，采用了增加中间变量的方法进行迭代，即 \[ a = tanh(μ_{Φ} + εσ_{Φ}) \\ σ \sim N(0,1) \]

SAC 解释性

待补充
alpha确定

I can't answer on behalf of the authors, but it makes sense to me that they would choose a default value ∝dim()∝dim(�). If they instead chose it at some fixed (i.e. not dependent on action dimensionality) constant, then problems with larger action dimensionalities would have to spread the same budget of randomness over the different action dimensionalities. For very large action spaces this might result in effectively deterministic behaviour. By choosing ¯∝dim()�¯∝dim(�), they are making sure that if you double the action dimensionality then you also double the amount of "total" randomness allowed by the policy, where "total" randomness is differential entropy (if you are confused why they have chosen a negative constant of proportionality, remember that differential entropy can be negative, and that differential entropies add across dimensions for diagonal stochastic policies; I think it just happens to be the case that for most mujoco tasks, it's effective to have relatively low target entropies. This might not hold for non-mujoco tasks.)