After downloading bc_rl_app.jar, please execute it by double-clicking, or typing "java -jar bc_rl_app.jar".

From left: Critic / Actor

The range of horizontal axes of critic/actor is -1 <θ < 1, and the range of vertical axes of critic/actor is -8.0 <ω < 8.0.

The colors of the approximated functions denote the following values.

blue: negative, green: 0, red: positive.

If the above application does not start, please install OpenJDK from adoptopenjdk.net.

Let us consider the reinforcement learning of CPG-controlled biped walking.

The model of biped walking (Taga, 1991) is written by an equation of motion of 14 variables

Moreover, by considering their time-derivatives (v

The torque T given to this model is determined by PD control scheme, and the destination angles θ

Based on Matsubara (2005), let us consider the reinforcement learning of CPG-controlled biped walking with the following scheme.

The reward r is defined so that the system receives a maximal reward when both the height of the hip and the horizontal velocity are constant.

The cumulated future reward following the policy π is defined as

By searching large values of V(θ

However, V(θ

It is known that V(θ

On the other hand, the unit which determines the control input is called "actor", and this is also determined by a function approximation.

In this applet, for the learning of the critic, we use TD(λ) method for continuous time and space, and, for the learning of the actor, we use policy gradient method.

To apply the above method to the control of the real robot, various simplifications were applied to the algorithm (unpublished yet).

This page is based on the following papers.

- [
**As for the used CPG**]

__Takashi Kanamaru__,

"Analysis of synchronization between two modules of pulse neural networks with excitatory and inhibitory connections,"

Neural Computation, vol.18, no.5, pp.1111-1131 (2006). (preprint PDF)

__Takashi Kanamaru__,

"van der Pol oscillator", Scholarpedia, 2(1):2202. - [
**As for the reinforcement learning for CPG-controlled bipedal walking**]

T. Matsubara, J. Morimoto, J. Nakanishi, M. Sato, and K. Doya

"Learning CPG-based Biped Locomotion with a Policy Gradient Method"

Robotics and Autonomus Systems, vol. 54, issue 11, pp. 911-920 (2006). - [
**As for the model of bipedal walking**]

G. Taga, Y. Yamaguchi, and H. Shimizu,

"Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment"

Biological Cybernetics, vol. 65, pp.147-159 (1991). - [
**As for reinforcement learning**]

Kenji Doya

"Reinforcement Learning in Continuous Time and Space"

Neural Computation, vol.12, 219-245 (2000).

Back to Takashi Kanamaru's Web Page