## Reinforcement learning series – getting the basics – part 5

Last time​​ we’ve talked about​​ learning​​ from partial episodes​​ with​​ the TD methodology:​​ https://g-stat.com/reinforcement-learning-series-getting-the-basics-part-4/. Another issue that rises while​​ solving​​ RL problems​​ – what should we do with problems that are too big like Go (10170​​ states) or​​ with​​ problems​​ that have​​ continuous​​ states?​​

The methods​​ that help representing and estimating the functions in​​ these problems are called value function approximation.​​ In this article we​​ would​​ only review value function approximation via deep neural network;​​ moreover,​​ we would see how to implement it in a​​ popular​​ methodology called​​ Deep Q-Network​​ (DQN). DQN is​​ a methodology developed by DeepMind in 2013. It was used to beat human experts in multiple Atari games.​​ DQN​​ solves​​ problems​​ that use picture frames as​​ states but​​ can​​ be modified for continuous problems.​​ I would use the hyper parameters of the DQN paper but remember​​ that​​ they might need some tweaking to best suit your​​ own​​ problem.

Figure​​ 1​​ Atari's Space Invaders – one of the games DQN beat human experts

As we’ve​​ learned in the first article there​​ are​​ differences​​ between RL and supervised learning. As we all know​​ DNN is​​ a supervised learning method that​​ requires the input to be i.i.d. Otherwise,​​ the model might be overfitted for some samples and the solution wouldn’t be generalized.​​ Moreover,​​ in Q learning we want to predict the action value function,​​ but after certain time steps we are updating it. This way we’ve got labels that change over time​​ for the same input.​​ ​​ This stability condition for the output and input is needed for the DNN to perform well.​​ However,​​ in RL both input and the target change constantly during the learning process making the training process to be unstable.​​ This unstable learning process is basically like a dog chasing his own tail.

We’ve​​ already seen the input and output can converge. So we might have a chance to model the action value function while allowing it to evolve,​​ if we slow down the changes in the input and output.​​ In DQN we​​ accomplish​​ this​​ in​​ two​​ ways:

• Experience replay –​​ Using a buffer in which we would​​ put​​ 1,000,000​​ transitions and sample a mini batch​​ of 32 transitions​​ from this buffer to train the DNN​​ (the buffer and the mini batch sizes are hyper parameters). This way we stabilize​​ the input,​​ because​​ random samplings from the replay buffer​​ are​​ more independent, thus making the mini batch closer to I.I.D.​​ As we create new​​ transition,​​ we would replace old transitions with new ones.

• Target network –​​ Creating two DNN​​ θ​​ and​​ θ-.​​ We would use​​ θ– to retrieve the action value function while​​ θ​​ would include all the updates​​ in the training process. After​​ 100,000 updates, we synchronize​​ θ​​ and​​ θ-. By doing so we’re fixing the action value function​​ temporarily,​​ so we don’t need to chase a moving target.​​ Moreover, by fixing​​ θ– parameter changes don’t impact​​ θ– immediately, thus even if the input is not I.I.D it won’t magnify its effect.

DQN uses Huber loss instead​​ of​​ quadratic loss we are used​​ to have​​ in supervised learning.​​ Huber loss​​ has​​ quadratic values​​ while​​ in range of​​ a certain​​ δ​​ and linear for larger absolute values. With Huber loss we allow​​ fewer​​ sharp changes​​ that​​ might damage the learning process. Huber loss:

Lδa=12a2          for aδδa12δ,    else

The DNN architecture takes 4​​ sequential​​ pictures​​ frames and feed it to​​ convolution layer​​ that​​ at the end​​ computes the action values for each action for the input state (4 sequential pictures):

 Deep Q-learning with experience replay Initialize replay memory D to capacity NInitialize action-value function Q with random weights​​ θInitialize target action-value function​​ Q^​​ with weights​​ θ–=θFor episode = 1 to M do: ​​ ​​ ​​​​ Initialize sequence​​ s1={x1}​​ and preprocessed sequence​​ ϕ1=ϕ(s1) ​​ ​​ ​​​​ For t= 1 to T do: ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ With probability​​ ε​​ select a random​​ at ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Otherwise select​​ at=argmaxaA(ϕ(st),a;θ) ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Execute action​​ at​​ in emulator and observe reward​​ rt​​ and image​​ xt+1 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Set​​ st+1=st,at,xt+1​​ and preprocess​​ ϕt+1=ϕ(st+1) ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Store transition (ϕt,at,rt,ϕt+1) in D ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Sample random minibatch of transitions (ϕj,aj,rj,ϕj+1)​​  ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Set​​ yj=rjrj+γmaxa'⁡Q^ϕj,aj;θ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Perform a gradient decent step on​​ yj–Qϕj,aj;θ2with respect to the network ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Parameters​​ θ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Every C steps rest​​ Q^=Q ​​ ​​ ​​​​ End forEnd for *ϕ​​ is preprocess of last 4 image frames to represent the state. In the paper they used 4 frames to capture motion. Source: Playing Atari with Deep Reinforcement Learninghttps://arxiv.org/pdf/1312.5602.pdf

This is the last part of the series I hope you’ve learned a lot and that I’ve ignited a desire within you to use and study RL even more.​​ If you want to dig​​ deeper,​​ I recommend reading Sutton’s​​ and Barto’s​​ “Reinforcement​​ Learning an Introduction” or watch David Silver’s lectures on​​ YouTube.

Yours truly,​​ Don Shaked

​​

הגיע הזמן לתפקיד הבא בקריירה? יש לנו עשרות משרות אטרקטיביות בתחום ה Data scientist תוכל/י לצפות במשרות כאן
או להשאיר פרטים (דיסקרטיות מלאה מובטחת!!)