|
Let us consider the reinforcement learning of CPG-controlled
biped walking.
The model of biped walking (Taga, 1991) is written by an equation of motion of 14 variables
Moreover, by considering their time-derivatives (vx, vy) and (ωR1, ωR2, ωL1, ωL2), the state of this dynamical system can be described. The torque T given to this model is determined by PD control scheme, and the destination angles θd are determined by Central Pattern Generator (CPG), which is utilized in the motion control by biological systems. In this work, as CPG, you can choose between the ensemble-averaged firing rates of pulse neural network proposed by Kanamaru (2006) and the outputs of coupled van der Pol oscillators. Based on Matsubara (2005), let us consider the reinforcement learning of CPG-controlled biped walking with the following scheme.
The reward r is defined so that the system receives a maximal reward when both the height of the hip and the horizontal velocity are constant. The cumulated future reward following the policy π is defined as
By searching large values of V(θR1, θL1, ωR1, ωL1), the model obtains the bipedal walking. However, V(θR1, θL1, ωR1, ωL1) is an unknown function, so we must perform a function approximation to obtain V(θR1, θL1, ωR1, ωL1). It is known that V(θR1, θL1, ωR1, ωL1) can be approximated correctly by minimizing the temporal difference (TD) error.
On the other hand, the unit which determines the control input is called "actor", and this is also determined by a function approximation. In this applet, for the learning of the critic, we use TD(λ) method for continuous time and space, and, for the learning of the actor, we use policy gradient method. To apply the above method to the control of the real robot, various simplifications were applied to the algorithm (unpublished yet). |