Reinforcement learning series – getting the basics – part 5

Last time we’ve talked about learning from partial episodes with the TD methodology: https://g-stat.com/reinforcement-learning-series-getting-the-basics-part-4/. Another issue that rises while solving RL problems – what should we do with problems that are too big like Go (10170 states) or with problems that have continuous states?

The methods that help representing and estimating the functions in these problems are called value function approximation. In this article we would only review value function approximation via deep neural network; moreover, we would see how to implement it in a popular methodology called Deep Q-Network (DQN). DQN is a methodology developed by DeepMind in 2013. It was used to beat human experts in multiple Atari games. DQN solves problems that use picture frames as states but can be modified for continuous problems. I would use the hyper parameters of the DQN paper but remember that they might need some tweaking to best suit your own problem.

Figure 1 Atari's Space Invaders – one of the games DQN beat human experts

As we’ve learned in the first article there are differences between RL and supervised learning. As we all know DNN is a supervised learning method that requires the input to be i.i.d. Otherwise, the model might be overfitted for some samples and the solution wouldn’t be generalized. Moreover, in Q learning we want to predict the action value function, but after certain time steps we are updating it. This way we’ve got labels that change over time for the same input. This stability condition for the output and input is needed for the DNN to perform well. However, in RL both input and the target change constantly during the learning process making the training process to be unstable. This unstable learning process is basically like a dog chasing his own tail.

We’ve already seen the input and output can converge. So we might have a chance to model the action value function while allowing it to evolve, if we slow down the changes in the input and output. In DQN we accomplish this in two ways:

Experience replay – Using a buffer in which we would put 1,000,000 transitions and sample a mini batch of 32 transitions from this buffer to train the DNN (the buffer and the mini batch sizes are hyper parameters). This way we stabilize the input, because random samplings from the replay buffer are more independent, thus making the mini batch closer to I.I.D. As we create new transition, we would replace old transitions with new ones.

Target network – Creating two DNN θ and θ-. We would use θ– to retrieve the action value function while θ would include all the updates in the training process. After 100,000 updates, we synchronize θ and θ-. By doing so we’re fixing the action value function temporarily, so we don’t need to chase a moving target. Moreover, by fixing θ– parameter changes don’t impact θ– immediately, thus even if the input is not I.I.D it won’t magnify its effect.

DQN uses Huber loss instead of quadratic loss we are used to have in supervised learning. Huber loss has quadratic values while in range of a certain δ and linear for larger absolute values. With Huber loss we allow fewer sharp changes that might damage the learning process. Huber loss:

Lδa=12a2 for a≤δδa–12δ, else

The DNN architecture takes 4 sequential pictures frames and feed it to convolution layer that at the end computes the action values for each action for the input state (4 sequential pictures):

For your convenience I’ve added the pseudo code from DeepMind’s article:

Deep Q-learning with experience replay

Initialize replay memory D to capacity N

Initialize action-value function Q with random weights θ

Initialize target action-value function Q^ with weights θ–=θ

For episode = 1 to M do:

Initialize sequence s1={x1} and preprocessed sequence ϕ1=ϕ(s1)

For t= 1 to T do:

With probability ε select a random at

Otherwise select at=argmaxaA(ϕ(st),a;θ)

Execute action at in emulator and observe reward rt and image xt+1

Set st+1=st,at,xt+1 and preprocess ϕt+1=ϕ(st+1)

Store transition (ϕt,at,rt,ϕt+1) in D

Sample random minibatch of transitions (ϕj,aj,rj,ϕj+1)

Set yj=rjrj+γmaxa'⁡Q^ϕj,aj;θ

Perform a gradient decent step on yj–Qϕj,aj;θ2with respect to the network

Parameters θ

Every C steps rest Q^=Q

End for

*ϕ is preprocess of last 4 image frames to represent the state. In the paper they used 4 frames to capture motion.

Source: Playing Atari with Deep Reinforcement Learning

https://arxiv.org/pdf/1312.5602.pdf

This is the last part of the series I hope you’ve learned a lot and that I’ve ignited a desire within you to use and study RL even more. If you want to dig deeper, I recommend reading Sutton’s and Barto’s “Reinforcement Learning an Introduction” or watch David Silver’s lectures on YouTube.

Yours truly, Don Shaked

אולי תאהב/י גם

Reinforcement learning series – getting the basics – part 3

using markov decision process (MDP) to create a policy – hands on – python example

Reinforcement learning series – getting the basics