Let us consider a task to swing up a pendulum using the reinforcement learning. |
The dynamics of a pendulum with a limited torque is written as
Our goal is to construct a controller which observes the state (θ, ω) of a pendulum and gives an appropriate value of torque u to the pendulum.
Under the scheme of the reinforcement learning, the controller can be constructed as shown in the following figure.
The cumulated future reward following the policy u is defined as
By searching large values of V(θ, ω), we can obtain the upright state of the pendulum.
However, V(θ, ω) is an unknown function, so we must perform a function approximation to obtain V(θ, ω).
It is known that V(θ, ω) can be correctly learned by minimizing the temporal difference error δ(t).
The unit which approximates V(θ, ω) is called "critic".
On the other hand, the unit which determines the input u according to (θ, ω) is called "actor" (n is a noise).
μ(θ, ω) is also an unknown function, we must perform a function approximation for μ(θ, ω).
Also in this case, μ(θ, ω) can be correctly approximated by minimizing the temporal difference error δ(t).
In this applet, for the learning of the critic, we use TD(λ) method for continuous time and space, and, for the learning of the actor, we use policy gradient method.