NAO learns body control from scratch
How can a real complex robot, such as NAO, learn new skills from scratch, such as rolling around or sitting up, simply by interacting with its environment? Without access to some simulation such learning would require at a minimum weeks of consistent training using previous stateoftheart methods. The training would in addition require a significant amount of assistance. Someone would need to reset the robot between different trials, as well as provide repairs to ensure consistent functionality. One of the PhD projects in the AI Lab of SBRE investigated this question, and we are now able to present an original framework that allows this type of control very quickly.
Before going into the details, let us view a video of what the results might look like.
What you see here is a robot learning the skill to roll around 360 degrees, after only 25 minutes of training and starting from scratch. How is it possible for such a complex robot to learn such a skill so fast? This is the question we will investigate in this post, which is based on the PhD thesis: Fast Online Model Learning for Controlling Complex RealWorld Robots.
What is a skill?¶
As a first step in developing a way for robots to autonomously learn new skills, we need to understand what we mean by “skills”. Intuitively we know that skills include things like “walking”, “grasping objects”, and perhaps also “telling jokes”. But what makes these skills? Is throwing one’s arms and legs around randomly on the floor a skill? From the description of the behaviour alone, it’s not always possible to tell. If this behaviour reliably produces some desirable outcome (for example getting attention from a parent) and the behaviour was also performed to create this outcome, it could be fair to view this behaviour as a “skill” as well. A skill can from this point of view be seen as a behavioural tool that can be used to create a predictable (and desirable) change to the world. “Walking” can be seen as a tool to change one’s position in an environment, “grasping” can be seen as a tool for changing an object’s position in a room, “telling a joke” can be seen as a tool for changing a friend's mood, etc. From this point of view, to teach robots skills from scratch, we first need to define objectives we want the robots to achieve. Classically this is made through the use of a handcrafted reward $r_t$ that associates the actions of a robot to how well those actions create a desired outcome. The skill of walking can then be expressed as optimizing how far a robot can move its body in a given time, see image below. The behavior is then slowly altered over time, in order to optimize this received reward. This idea is at the core of Reinforcement Learning (RL). Although this approach is very prevalent in the state of the art, we will ultimately avoid it. To understand why, it is necessary to first understand the setup of the system so that we can see why optimising such a reward is so timeconsuming.
The setup¶
We are here considering robots with many degrees of freedom (DoF), where each degree is a joint angle $\alpha_i$ the robot can control using some motor.
In the case of Nao (to the left) there are 25 DoF. In some experiments, we will also consider planar arms (to the right) with up to 1000 DoF. In the case of the planar arm, knowing the value of each joint angle gives us an exact description of the state of the arm (at rest), and we can thus define its state $s_t$ by these angles. In the case of the humanoid robot, we also need to add the angles of the robot in relation to the floor (often referred to as the floating base), in order to define its state $s_t$. These robots are then controlled by sending signals to the robot’s controller, specifying how the motor of each joint angle should behave, until a new command is given. This control signal $a_t$ is the action. The given action then leads to some kind of change in the robot’s state, changing it from $s_t$ to $s_{t+\Delta t}$ after a time period $\Delta t$.
This process of doing an action and observing the outcome can be used to learn the dynamics of the system. For example, attempting action $a_t$ in state $s_t$ informed us that the action signal $a_t$, sent to a robot in state $s_t$, can result in new state $s_{t+\Delta t}$ after a time $\Delta t$. For simplicity we will assume $\Delta t = 1$, so that every action gives us a data sample $(s_t, a_t, s_{t+1})$. Note that for a robot learning from scratch, the states the robot can be in and the actions to move between them are the only type of data we have access to. No a priori model of the world, nor of the robot’s body is available.
Challenges
Considering this setup, we face the following 5 challenges.
1. Sparsity¶
The only data we have access to, is the data the robot itself collects by executing different actions in different states. Since each of these data points needs to be collected in the real world, they will naturally require some time to produce, often in the order of a second per sample. A robot able to run nonstop throughout an entire day would then be able to collect around $24\times 60\times 60 = 86,400$ samples, a tiny amount in the context of big data and deep learning.
2. No reset¶
The robot can only control what action $a_t$ to attempt, and not the state $s_t$ from which the action is attempted. A robot has to act from whatever state it finds itself in. This also means that a robot that wants to learn the best action in a given state $s_t$ has no natural way of going back to $s_t$ after a first action $a_t$, to explore the outcome of some other action $\tilde a_t$. This makes optimization with respect to some reward function difficult. How could you for example fairly compare two behaviours for walking if the first behavior is tried on a robot standing up, and the second on a robot that is lying down (after the first behavior lead the robot to fall over)?
3. Dimensionality¶
As we consider robots with relatively high numbers of DoFs, it follows that both the state space and the action space must have at least as many dimensions as the number of DoF (since each DoP can be changed independently). To learn how to act to achieve desired outcomes, it is generally necessary to have examples of earlier actions, from similar states, so that we can approximate the outcomes of actions from our current state. In principle this means that the robot needs to have earlier experiences of almost any type of action, from any starting state it can find itself in. Such an approach however quickly becomes infeasible as we consider robots of more than a few DoF. Consider for example a robotic arm with 1 joint, where the state of the arm is the joint angle and the action is the torque of the joint. This gives us 5 different torques for each of 10 possible joint angles, i.e. $5 \times 10$ “stateaction” combinations to explore (ignoring how the robot would reach each state to begin with). Now consider a robot with 2 DoF. Using the same resolution we now have $5^2 \times 10^2 = 2500$ different stateactions to consider. For a robot such as Nao, with 25 DoF, this kind of approach is simply not possible. Even if we only consider 2 settings for each joint, both in state and possible actions, we end up with $2^{25} \times 2^{25} \approx 10^{15}$ possible stateactions to attempt. This is often referred to as the “curse of dimensionality”, and is one of the main reasons why most machine learning approaches are so time consuming for robots, even with relatively few DoF.
4. Planning¶
The fourth challenge to consider is the issue of planning. Many objectives can only be achieved through a series of actions with intermediary states. This means that given a particular state $s_t$, it’s not only necessary to consider all possible actions $a_t$ from this state, but also all possible actions it could attempt from each of the states it might reach, and so on. Such a search becomes intractable almost immediately with continuous action and state spaces, even for robots with very few DoF.
5. Risk of damage¶
A final challenge of learning anything from scratch in the real world, is that each action poses a risk of damaging the robot in some way. Without previous knowledge of the world it is also impossible for the robot to know in advance what actions might be dangerous. Having the robot break is not just bad from a monetary point of view, but also because damage during learning changes the dynamics of the robot. This means that previously collected data points might not represent how the robot functions anymore. As a side note a broken robot by itself is not a problem for a robot learning skills from scratch. The robot will learn whatever it can do with whatever body it is given and see no difference in principle between a whole and a broken body (although some damages might make some skills harder to learn). We will not go too deep into how to handle risks of damage in this blog. In short: damage can be minimized by minimizing training time, and by using lower motor torque for motions that are more “exploratory”.
Goal Babbling and task space control
Considering the challenges, it seems clear that one of the main difficulties of the project is how to tackle the many DoF. One approach able to handle some types of robots with many DoFs is Goal Babbling (GB). In contrast to most approaches for skill learning, GB does not use a reward function in order to optimize behaviours to achieve a specific outcome. Instead GB focuses on learning general control over a given Task Space $X$.
For walking, such a task space could for example be the displacement $\Delta x_t$ over some period of time. This is different from the reward function formulation for one important reason. With rewards the goal would be to optimize the distance $\Delta x_t$. In GB the goal is to learn to create any desired displacement at will. This means that given an arbitrary goal displacement $\Delta x^ *$, the robot attempts to get as close to this goal as possible using an action $a_t$. Here’s the kicker: It doesn’t matter if it succeeds or not. Whatever the result of the action is, there will always be some outcome $\Delta x_t$. As we want to learn to create any displacement, the knowledge for how to reach $\Delta x_t$ is potentially equally valuable as the knowledge for how to reach the desired $\Delta x^ *$, and we just learned exactly how to displace the robot by $\Delta x_t$. Observe that this applies to robots of any complexity. Every action will inform us of how to create some specific outcome.
GB has one big drawback. It only works if the state of the agent can be ignored when considering the outcome of a given action. In other words, we need a situation where we can assume that if an action $a_t$ created an effect $x_t$ once, we can recreate this effect by doing action $a_t$ again, without considering the particular state the robot is in. This can either be an intrinsic property of the system, or accomplished by always resetting the robot to a given starting state after every action, referred to as a home state.
Example¶
Perhaps the best way to understand how Goal Babbling can give rapid control of a task space is to see it in action. For this we will consider one of the most simple versions of GB, where actions $a_t$ for reaching goals $x^ *$ are simply generated by taking the previous action $a_i$ with outcome $x_i$ that ended up the closest to $x^ *$, and then add some exploratory noise proportional to how far $x_i$ is from $x^ *$.
Consider a robotic arm with any number of DoF, where the joint angles uniquely defines the state $s_t$ of the robot. In addition, we also consider the action $a_t$ to be the specification of what we want each joint angle to change into. This means that with perfect control we would always get $s_{t+1} = a_t$. Assuming such a perfect controller for now, this means that $s_{t+1}$ only depends on the action $a_t$, and not the state $s_t$ from which the action was made.
The goal in this scenario is to learn to control the 2D position $x_t$ of the endeffector, with $x_t \in X \subset \mathbb R^2$, where $X$ is our task space. We can now learn to control this task space as follows:











Limitations of Goal Babbling (GB)¶
In this example we saw how endeffector control can be efficiently learned using Goal Babbling. The approach is as already mentioned limited. It requires a system where outcomes of actions are independent of the states in which they are made, or alternatively where the agent is reset to a given home state after every action (so every action is made from the same state). This second approach with home states is more flexible in that it applies to robots where the state actually influences the outcome of an action. It comes with some drawbacks however. First of all it requires someone to define a starting state the robot is always able to return to. A humanoid robot can’t have a starting state standing up for example, unless some experimenter is there to manually reset it after every action, since many actions might make it fall over. Secondly the approach limits the agent to only be able to reach goals that can be reached in a single action from this starting state.
Consider the robots bellow, were $s$ is the state of the robot, $\hat s$ is the state we tell the robot to attempt to reach (i.e. the action), and $s’$ is the state the robot actually ends up in (when trying to move from $s$ to $\hat s$):
To the left is a planar arm similar to the one before, but now with the addition of impenetrable walls. In this case the outcome of attempting to reach a goal $\hat s$ depends on the state from which the arm starts. As we can see, trying to reach $\hat s$ results in the state $s’$ if starting from $s$. Had we instead tried to reach $\hat s$ from some other state, the outcome would likely be different. To the right is a humanoid robot trying to control its orientation with respect to the floor, which it measures as the direction of gravity relative its torso. Also in this case are outcomes of actions dependent on the previous state of the robot. As an example, assume we tell the robot in the image to keep all joint angles in their current values, then it will stay in that state. Assume then that we tell a robot lying on the belly to reach the same joint angles, then that robot would continue lying on its belly, and not roll over to its back. The same action thus had different outcomes, depending on the state from which it was made.
The fact that the outcome of actions depends on the states in which they are made, means that we cannot simply map an action to a given outcome. But could we perhaps get around this by always resetting the robot to a given home state after every action, so that we only consider actions from that home state? For the specific setups seen above, this could work. In both these cases there is really no single action the robots can do that will take them so far away from their starting positions so that they cannot return there immediately but setting that state as a goal. (This is not generally true for the humanoid. Had the humanoid started out standing and the robot would fall over it would not be able to get up again simply by attempting to reset its joint angles.)
Using the home state approach, the robots are able to reach the regions seen above. In the case of the arm the bright regions are reachable, and in the case of the humanoid robot, each dot shows a reached position in the body orientation space, where the xaxis is rotation with respect to the floor, and the yaxis measures how upright the robot is.
The consequence of starting every action from a single state implies that:
 In the case of the arm (left figure) it is unable to reach goals behind walls (from the home state)
 In the case of the humanoid (right figure) it can only access orientations on its back and up to its sides. It’s in other words unable to for example roll over to its belly. In that case, also note that even if it would find a way to do this, reaching the belly would make it unable to rest itself to the home state (lying on the back).
Going beyond Goal Babbling (GB)¶
To extend GB to new domains, consider the simple agent bellow:
In this case the state $s_t$ of the agent is its 2 dimensional position, the action $a_t$ is the position it attempts to reach next, and $x^ *$ is the goal position we want it to reach. The rule is that a goal state $a_t$ is always reached if it can move to it in a straight line without crossing a wall. The question is now, given a starting state $s_t$ and a final goal $x^ *$, how can we find a sequence of actions ${a_{t+1}, a_{t+2}, \dots}$ that allows us to reach $x^ *$ without crossing any walls? One possible way to do this is to consider a set of home states ${s_1, s_2, \dots}$.
The left image shows 4 possible such home states, with a pink area exemplifying positions the agent can reach directly from $s_2$. $s_2$ thus exemplifies a home state to a local GB problem , to which we can learn an action model, similar to regular GB (but with the difference that we now have multiple such models in parallel). Besides learning a GB model to each home state $s_n$, we can then also learn the probability to move directly between these different home states. $P(n, \hat n)$ could thus express the likelihood that the agent will end up in $s_{\hat n}$, if doing action $a = s_{\hat n}$ from state $s_n$. This gives us a graph with 4 discrete nodes, with $4^2 = 16$ different possible transition probabilities to learn. This structure then allows us to reach any goal $x^ *$ from any state $s_t$ using the following strategy:
 If the robot is in a state $s_t$ that is not a home state $s_n$, the agent can attempt to reach different home states until one is successfully reached. Since every position is reachable from at least some home state, this is always possible.
 If the agent is in a home state $s_n$ from which the goal $x^ $ is not reachable, look at probabilities $P(n,\hat n)$, and plan a sequence of home states leading to a home state $s_{n^}$ from which $x^ *$ can be reached.
 Once in the home state $s_1$ from which $x^ *$ can be reached, use the goal babbling model created for that particular home state. In the simplest case this would be the action $a_i$ previously made from $s_1$, with the outcome $x_i$ closest to $x^ *$, and then add noise proportional to the distance between $x^ *$ and $x_i$ (as in the normal GB case).
Fundamentally, we have split the problem into two different levels of abstraction, one lower level in the continuous motor space, and one higher, where we consider transitions between our discrete set of home states:



State 
$s_t$  is the continuous state of the robot, including all joint angles etc. 
$n$  is the index of the current home state $s_{n}$ the robot is in (and if $s_t \neq s_n$ it will reset to such a state) 
Action 
$a_t$  is the command sent to all motors of the robot, specifying how they should behave until the next action. 
$\hat n$  is the index of the home state $s_{\hat n}$ the robot should try to reach next. The probability that this will succeed is $P(n, \hat n)$. 
Goal 
$x^ *$  is the task space outcome the agent should attempt to produce. 
$n^$ is the index of a home state $s_{n^}$ from which $x^ *$ can be reached in a single action $a_t$. 
The higher level can thus be used to plan transitions between home states to reach one from which a goal $x^ *$ can be reached. The lower level motor control of the local GB model can then be used to reach out for $x^ *$ more precisely.
Implementation on real systems¶
In the context of the planar arm, this leads to a model similar to the one bellow:
Each home state $s_n$ is here a unique joint angle configuration of the arm, where the red dot shows where the end effector ends up if that state is reached. The higher level graph is then created by observing the arms ability to move between these postures without being stopped by a wall or selfcollision. The circle around each home state is the part of the task space that is assumed to be reachable from each home state, and new home states are added whenever the end effector reaches a position outside of these. Specifically in this image we see that if the agent starts in a state $s_t=s_3$, with a goal $x^ *$ on the opposite side of the wall, it uses home state to homestate planning to transition $s_3 \to s_4 \to s_5$, from which it is using GB to reach $x^ *$ more exactly (where $\pi_n$ is the GB model for action selection from $s_n$).
For the humanoid robot we can use the exact same approach. In this case a home state $s_n$ is a given posture together with a given body orientation. Movements between home states $n \to \hat n$ is done by attempting to reach the posture of that home state, and observe if reaching this posture also results in the same body orientation. If so $P(n, \hat n) = 1$ and otherwise $0$ (in practice this is approximated over time). This way we create a discrete graph between different home states. As for the arm, each home state is assumed to reach some part of the body orientation space, where new home states are added when the agent reaches an orientation outside of these reachable areas.
Results
The single framework that was used for both the planar arm and the humanoid robot can be found here. To evaluate the performance we consider the “reaching error”. It corresponds to how close the robot will get on average to a randomly generated goal $x^ *$ within a limited number of actions, starting from whatever state it happens to be in when the goal is generated. Starting with the planar arm we observe the following outcome.
To the left we see the reach error over time, depending on the number of DoF of the arm. Each iteration is a single action. Given a task space area of $1 \times 1$, this means that the arm is able to reach any goal with an error less than $0.01$ after only $10,000$ iterations. This learning speed also seems fairly independent from the number of DoF, which is a fundamental property of GB. To the right we see the precision with which different parts of the space can be reached, depending on how big the region is assumed to be reachable around each home state. Small regions means a lot of home states to cover the whole task space, big regions means fewer home states. In the extreme only a single region is considered and it is assumed (incorrectly) that all goals can be reached from it. This is identical to GB where the agent always returns to one given state, and we see in this case that many parts of the task space become unreachable because of this incorrect assumption.
For the humanoid robot we can return to the initial video, where we saw the robot learning to roll around $360$ degrees after only $25$ minutes of training by creating the model below. What we see here is a ring for the position in orientation space of every home state (so rings around $\theta \approx 0$ corresponds to states on the back, $\theta \approx 0$ would be sitting or standing up, etc), and the edges between the home states indicates observed probability to transition directly between those home states.
What is important to observe here is that the robot did not have “rolling around” as an objective to learn. It tried to learn general body orientation control, and the skill to roll around was generated on the fly, as a consequence of in this case being asked by a user to first move to its belly, and then back to its back.
Besides observing the exact model the robot built, we can also evaluate what parts of its orientation space it managed to reach during training.
In a separate experiment we added a structure on the back of the robot, to see how this modified body would impact what parts of the task space it would be able to learn to reach.
The main idea was actually to make it harder for the robot to stay flat on it’s back. However, the robot rapidly discovered that the structure could be used to sit up, as can be seen in the sequence 15 below. Again, the robot was not told to learn the “skill” of sitting up, but the behavior was developed as a way to reach body orientations in which the head points straight upwards.
Note how the change in body changed the reachable areas of the orientation space, and we can see this after only 4050 minutes of real world interaction. This means that we could potentially also use this type of learning to get rapid feedback on different robot designs in terms of flexibility.
Conclusions
In conclusion, this work introduces a novel approach for skill learning with the following properties:
 The method successfully extends GB to cases where the current state of the robot has to be taken into account, w.r.t to the outcome of the actions.
 The method is rapid and can be used on complex robots with only a short time for interactions with the environment (a largely unsolved domain)
 The learned models allow many complex behaviours to be generated, such as sitting up, rolling around, or reaching around obstacles.
 Because the method is so fast, it allows for rapid evaluation of a given body design.
To learn more about this method, see the thesis: Fast Online Model Learning for Controlling Complex RealWorld Robots.