Controlling Robots with Human Poses

A Research project about Pepper and NAO immersive teleoperation
Controlling Robots with Human Poses

A new immersive teleoperation solution based on Extreme Learning Machine (ELM), a Machine Learning technique, is introduced for Pepper and NAO robots. Immersive teleoperation is defined as a robot remote control that renders the operator the sensation of being inside the teleoperated environment and providing a feeling of being the robot itself. The solution is independent of other mapping approaches (e.g. inverse kinematics) and the user’s whole body is used for the robot control. Even with scarce training data, the solution returns satisfactory results in both precision and computational speed.

The Project goals

The whole solution tested using the simulated Pepper

Our project aims to develop a new solution for robot remote control (i.e., teleoperation). The basic functionality of teleoperation is to send simple commands to a distant robot using an external controlling system, such as a keyboard or a joystick. These systems restrict complex control and lower the accessibility of the usage (i.e. a learning phase is necessary). Thus, an immersive teleoperation system allows complex motions and renders a very intuitive solution to use with precision.

A demonstration of different setups: simulated Pepper, Pepper robot, simulated NAO and NAO robot

Immersive teleoperation systems control a robot (e.g. humanoid) by generalizing and adapting human movements to the robot configuration (i.e., human-robot correspondence mapping). The goal of immersive teleoperation and thus our expectation is a robot that can perfectly imitate humans motions, including complex ones, through teleoperation. A possible extension of our solution would be to gather demonstrations to teach new behaviors to the robot or to ease the animators work in designing Human-Robot Interaction (HRI).

The 3 Main challenges of immersive teleoperation

Teleoperation faces several challenges which are:

  1. Human-robot correspondence mapping
  2. Fast computation
  3. Precision

1. Human-robot correspondence mapping

The most critical problem of teleoperation is the human-robot correspondence mapping. Humans and robots, even humanoid ones have different body forms. These differences in geometric structure, such as segment length and degrees of freedom (DOF), must be taken into consideration for the matching. As a result of this difference, an additional work of finding the transformation between the two structures, which is not evident and tiresome, is required.

In addition to the aforementioned issue, the human-robot correspondence task, that enables the robot to imitate complex motions, is time-consuming and is dedicated to a single geometric model. When imitating such motions, the robot’s restrictions (e.g., singularity and workspace limit) must also be considered.

2. Fast computation

A fast computational speed is expected to fulfill the needs of immersive teleoperation such as having a high number of frames per second (fps) for it to render real-time motions. For this to happen, the time complexity of the teleoperation must be lowered. To solve this problem a simple and rapid algorithm must be integrated into the system.

3. Precision

A high precision is appreciated for the teleoperation especially for precise tasks such as manipulation and grasping; however, when designing the solution, its trade-off with speed must be taken into account.

Immersive teleoperation based on ELM

The main issue for an immersive teleoperation that needs to be addressed is the human-robot correspondence mapping [1]. For this, the Extreme Learning Machine (ELM) based kernels technique is chosen for our solution. Unlike the traditional inverse kinematics (IK) based teleoperation approach, which was used to control SBR robots [2] [3], ELM has several advantages that allow the robot to avoid problems inherent to IK.

Our ELM trains with angular configurations instead of Cartesian positions that prevents it from being constrained by robot singularities and workspace limits. Moreover, the angular configuration is predicted directly at time t without relying on past instances. ELM can also generalize the problem for all types of robots (i.e., different degrees of freedom (DOF) and segment lengths). Furthermore, ELM is known for its fast computational speed and it produces satisfying results even with scarce training data.

For the setup of our immersive teleoperation, Nuitrack [4] and virtual reality (VR) sets were chosen. Nuitrack tracks the user’s movements (i.e., extracts human skeletal information via a depth camera) and returns key joint information. The robot can thus be controlled solely based on human motions. The integration of the VR system adds visual feedback and extra motion controlling options (e.g, head rotation, hand control and locomotion) to our solution. These extra features allow the user to be the robot through whole body control and first person point of view along with head movement control via the VR headset. Thus, the immersive aspect of our solution is increased.


The aim of the ELM mapping is to establish a link between the human space $\mathscr{H}$ and the robot space $\mathscr{R}$. For each human pose in $\mathscr{H}$, its corresponding robot pose in $\mathscr{R}$ is found.

Data collection

The first step is the data collection. As the ELM is chosen as our teleoperation technique, there are no restrictions in the type of our training dataset, meaning that the choice of the input and output data for our training is purely up to our decision. To have a direct correspondence mapping and to train our teleoperation system with the most relevant information, human skeletal landmark positions paired with robot angular configurations data (i.e., data pairs in $\mathscr{H}$ and $\mathscr{R}$) are collected.

The collection method of obtaining the dataset from video stream captures of human motion imitation of the Pepper robot was proposed. This enables the inclusion of complex poses with ease.

For the human motions of the dataset, key joint information was extracted, thanks to Nuitrack. The data was then processed as unit vectors in their respective parent link frames, in Figure 1, hence the data is normalized. The benefit of using such unit vectors is that the data collection is generalized for all humans with different body forms (i.e., segment lengths).

Figure 1: Illustration of human key joint unit vectors in their respective parental link frames
Figure 1: Illustration of human key joint unit vectors in their respective parental link frames

Concerning the robot dataset, Pepper’s angular configuration is obtained by using the NAOqi API.

Two people are required for our method. A person manipulates the robot and the other person imitates the robot’s motion in front of the camera. The person controlling the robot stands next to the camera facing the other person. Using this method, the data collection is implemented in a synchronized fashion so that the two components of the pair are fetched at the same time frame. In addition, it only takes around 20 minutes for the data collection to get a good coverage of the workspace, shown in Figure 2.

Figure 2: PCA visualization of the workspace coverage of our data
Figure 2: PCA visualization of the workspace coverage of our data

Two other ways were explored to collect the data. For the first one, the data can be gathered by a single person in two steps. The human motions can be defined and then the same person can then reproduce them with the robot. The problem with this technique is that there is a time delay between the collected data pairs which cannot be easily adjusted. The second method is to compute the robot configuration using another mapping approach (e.g. IK) directly from the collected human motion. This allows data pairs to be synchronized but the dataset depends on the precision of the used technique. As our solution aims to improve the traditional teleoperation, this method is not suited for our case.

The Pepper workspace is retrieved by simulating randomly sampling Pepper articular configurations (i.e., 20002 samples) on a simulated Pepper using qiBullet [5]. This Pepper workspace is visualized in Figure 3. The Pepper workspace can be defined as a set of points that can be reached by its end-effectors, namely its left and right hands.

Figure 3: 3D visualization of Pepper workspace
Figure 3: 3D visualization of Pepper workspace

To check if our collected data (i.e., trajectories) covers the total workspace. Our data is overlapped on the workspace, visualized in 2D in Figure 2, with the help of principal component analysis (PCA) [6]. PCA is a technique that projects a given data to a lower dimensional space by performing a linear dimensionality reduction.

The partial coverage of our data can be explained by two factors:
• Angular configurations of unwanted poses, that cannot be detected with our motion capturing technique (e.g., poses with arms behind the back) and those that are unnatural, are not collected.
• Our data contains angular configurations that are out of the workspace’s range as it includes positions where the robot self-collides whereas the workspace does not.

Extreme Learning Machine

Extreme Learning Machine (ELM) is a Single hidden Layer Feedforward Network (SLFN) that has a rapid training speed with tuning-free neurons. Three different ELM kernel structures have been tested to find the best solution for immersive teleoperation which are:
• Global ELM kernel
• Local forward ELM kernel
• Local forward backward ELM kernel

Global ELM kernel

Global ELM kernel feeds the whole training data (i.e., all pairs in $\mathscr{H}$ and $\mathscr{R}$) to a single kernel as its input. Thus, this kernel can be pretrained and its weights can be saved to be called for each computation, shown in Figure 4.

Figure 4: Illustration of global ELM kernel
Figure 4: Illustration of global ELM kernel

Local Forward ELM kernel

Local forward ELM kernel is trained only with the k nearest neighbors of the inputted human configuration (i.e., k neighboring human poses of the input human pose in $\mathscr{H}$ and their matching robot poses in $\mathscr{R}$), illustrated in Figure 5. A new training is done for each new prediction.

Figure 5: Illustration of local forward ELM kernel with k = 5
Figure 5: Illustration of local forward ELM kernel with k = 5

Local Forward Backward ELM kernel

For our solution, the local forward backward ELM kernel introduced by S. Jin et al. [7], was used.

It starts by finding $k$ nearest neighbors of the inputted human pose $h^*$ and trains a forward ELM kernel $K_i^h$ for each neighbor (i.e., human configuration in $\mathscr{H}$ as input data labeled with its corresponding robot configuration in $\mathscr{R}$). $k$ robot configuration prediction candidates $r_j^c$ are obtained through the forward projections of $h^*$.

The sole use of the single forward kernel cannot guarantee the validity of bijective mapping (i.e., one-to-one correspondence). To refine the bijection, the optimal robot configuration, which diminishes the deviation of the backward projection, is chosen as the projection of $h^*$. The selection of the best fitting $r_j^c$ (i.e, optimal robot configuration candidate $\bar{r_j^c}$) is done by firstly finding $m$ nearest neighbors $r_{j,k}^c$ for each $r_j^c$ and then training a backward ELM kernel $K_{j,k}^r$ (i.e., robot configuration in $\mathscr{R}$ as input data labeled with its corresponding human configuration in $\mathscr{H}$) for each $r_{j,k}^c$. For each cluster of $m$ backward kernels, a set of weights $w_{j,k}$ that minimizes the distance between the back-projected human configuration $K_{j,k}^r(\bar{r_j^c})$ and $h^*$ is deduced. The determination of the weights is based on perceptron weight update [8]. $\bar{r_j^c}$ is defined as:

$\bar{r_j^c} = \sum_{k}w_{j,k}r_{j,k}^c$

$min_{w_{j,k}}\left \{ \left \| K_{j,k}^r(\bar{r_j^c}) -h^* \} \right \| \right \}$

$s.t. \sum_{k=1}^{m}w_{j,k}=1, \ \ \ w_{j,k} \geq 0$

Among the $k$ $\bar{r_j^c}$s, the $\bar{r_j^c}$ that is associated with the closest human pose $K_j^r(\bar{r_j^c})$ to the inputted human pose $h^*$ becomes the final prediction of robot configuration, which can be expressed as:
$min_{j}\left \{ \left \| K_{j,k}^r(\bar{r_j^c}) -h^* \} \right \| \right \}$.

Figure 6 illustrates the whole local forward backward ELM kernel.

Figure 6: Illustration of forward backward ELM kernel with k = 5 and m = 5
Figure 6: Illustration of forward backward ELM kernel with k = 5 and m = 5


Through qualitative tests, the performance of the above-mentioned ELM techniques was evaluated based on pure observation. Our solution was tested with a simulated Pepper using qiBullet [5] and with a real Pepper robot.

Teleoperation based on the global ELM kernel returns numerous wrong predictions that critically deteriorate the quality of the solution. An example of an erroneous prediction is shown in Figure 7.

Figure 7: Example of an erroneous prediction by global ELM kernel (a) Inputted human configuration (b) Expected robot configuration (c) Wrongly predicted robot configuration
Figure 7: Example of an erroneous prediction by global ELM kernel -a- Inputted human configuration -b- Expected robot configuration -c- Wrongly predicted robot configuration

The quality of the local forward ELM kernel tends to be fairly acceptable. It contains less prediction errors but has a lot of vibrations in its motion. These vibrations occur when the prediction of the robot configuration $r_t$ is distant from its preceding $r_{t-1}$ and succeeding $r_{t+1}$ predictions.

Thanks to the bijective property of the local forward backward ELM kernel, the quality of the teleoperated robot becomes noticeably better yet containing small vibrations and some prediction errors.

When comparing the three ELM techniques, the local forward backward ELM kernel is the most precise one followed by the local forward ELM kernel and the global ELM kernel.

Our solution was initially designed to control the Pepper robot. As the ELM technique predicts robot angular configurations directly from human key joint information, the solution is anticipated to work immediately on NAO with the same training dataset. This is because NAO has a similar body structure (i.e., same geometrical structure). For other robots with a different geometrical model, our solution can be used by simply recollecting the training data without additional modifications.

This assumption was verified by applying our solution directly to a simulated NAO and a real NAO robot. A demonstration of the four ELM setups (i.e., simulated Pepper, Pepper robot, simulated NAO and NAO robot) can be seen in Figure 8.

Figure 8: Demonstration of ELM setups

The whole solution, local forward backward ELM combined with VR sets, was tested using the simulated Pepper, shown in Figure 9.

Figure 9: Demonstration of our full solution


A new solution based on the ELM approach is proposed for a novel immersive teleoperation for Pepper and NAO. The ELM is not constrained to robot singularities and workspace limits. It can predict independently of past estimations and generalize for all types of robots. Most importantly, even with scarce training data, it returns satisfactory results in both precision and computational speed.

Our solution can be further extended for both industrial and research applications. For the industrial application, it can be used to facilitate the task of the animators, also it can add telepresence features for Human-Robot Interaction (HRI). Considering the research side, the usage of our solution can gather demonstrations to teach new behaviors to the robot.


  1. F. Biocca, “The cyborg’s dilemma: embodiment in virtual environments,” IEEE Int. Conf. on CognitiveTechnology Humanizing the Information Age, pp. 12-26, 1997.

  2. C. Nehaniv and K. Dautenhahn, “Mapping between dissimilar bodies: Affordances and the algebraic foundations of imitation,” 1998.

  3. “Pepper remote arm control using leap motion.” [Online]. Available:

  4. “[sombrero project] romeo immersive teleoperation.” [Online]. Available:

  5. “Nuitrack documentation.” [Online]. Available:

  6. M. Busy and M. Caniot, “qibullet, a bullet-based simulator for the pepper and nao robots,” arXiv preprint arXiv:1909.00779, 2019.

  7. I. T. Jolliffe, “Principal component analysis,” Springer Series in Statistics, 2002.

  8. S. Jin, C. Dai, Y. Liu, and C. Wang, “Motion imitation based on sparsely sampled correspondence,” Journal of Computing and Information Science in Engineering, vol. 17, 2016.

  9. S. I. Gallant, “Perceptron-based learning algorithms,” IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191, 1990.

The author of this post, Jieyon WOO built up experiences worldwide in the fields of Machine Learning and Robotics until she joined the AI Research Center (AIRC) at SoftBank Robotics as a Research Software Engineering Intern. Jieyeon has graduated in September 2020 with the Master of Science of Intelligent Systems Engineering from Sorbonne Université, Paris.