Is it possible to get a robotic hand?
The Future of Hand Generation & Control
In recent years, gesture interaction has great commercial value and development potential in different areas, such as the application of human-computer interaction (HCI) [9-12] and virtual/augmented reality (VR/AR) [6-8]. As a branch of gesture interaction, 3D hand pose estimation is a long-standing problem which has drawn a lot of attention in the field of computer vision [1-5], and it has gained great success after many years of research. However, many existing works require depth cameras [13-15] or multi-view strategy [16, 17], which is undesirable for its high cost and inapplicability under some situations, and the advance in monocular RGB-based 3D hand pose estimation remains limited.
The major problem considering the monocular RGB images tracking is depth ambiguity. And recent developments’ performance becomes limited due to the difficulty of obtaining comprehensive and effective 3D labeled training data.
Some propose a series of methods that overcome the depth ambiguity and collision constraints. In a multi-camera setup, but the use of it will limit application scenarios, so people turn to utilizing monocular cameras. In order to get 3D hand-joint data, some people focus on the synthetic generation of training data but still struggle with some occlusions. For instance, both introduce synthetic hand models and generate a series of 3D hand-joint data for training the networks. However, the lack of annotated dataset of real-world images limits their performance.
To overcome the problems in previous works, we propose a new approach to estimate 3D hand pose configuration in a monocular RGB image. In order to preserve real-world features and fetch effective 3D hand joint training data, we combine the advantages of synthetic hand model and real-world hand pose dataset. We divide the hand pose annotation process into three independent stages:
locate hand area, label 2D hand joint position, and map 2D coordinates to 3D ones.
First, we collect a large scale of real-world images, and put them into an object detection framework, e.g., YOLO-v3 , faster RCNN , SSD , to locate the area of hands. Second, we build a dataset to label 2D hand joint positions in real hand images. Considering that synthetic hand models may not generalize well to real hands and limit the estimation performance, we use real-world images to avoid the problem brought by style differences between synthetic hand model and reality. In those two steps, we can preserve features of ground truth properly.
Finally, we map 2D coordinates to 3D coordinates. The occlusions at this step due to depth ambiguity and camera viewpoint variation affect the recognition accuracy, and one solution to this is building an effective 3D hand pose dataset. Thus, we propose a complex and high-precision hand generation model to build a dataset of 3D coordinates of skeleton hand pose. The model is mainly composed of two parts: palm planes and finger planes, the hand palm is divided into four sub-planes, and the joints of each finger are located in a corresponding finger plane. By setting rotation angles and some hidden parameters, the 3D skeleton hand model can cover the vast majority of hand poses and take special pose limitations under real situations into consideration.
In summary, our main contributions are:
● A hand tracking model that tracks 3D hand joint position from a monocular RGB image. ● A 3D skeleton hand model that generates the data of 3D and 2D hand joint coordinates. ● An FCN that learns a 2D-3D coordinate mapping function.
The 3D hand pose estimation is a challenging problem, mainly due to the restricted availability of 3D coordinate data and the gap between synthetic hand models and real ones, like lighting, camera views, and skin texture. There are two mainstream approaches to solve the problem, i.e., multi-view method and synthetic hand model.
The multi-view approach leverages the difficulty of embedded high-dimension and non-linear regression problems, improving the performance of hand pose estimation.
To better extract 3D information from one depth image to generate accurate estimations of 3D locations, Ge et al.  project the query depth image onto several orthogonal planes and use these multi-view projections to regress 2D heat-maps which estimate the joint positions of each plane. Wang et al.  use a wide baseline camera setup to build a data-driven pose-estimation system to track hand pose. Gomez Donoso et al.  contribute a novel multi-view hand pose dataset, however, the real-time hand pose estimation is still limited to 2D. Spurr et al.  propose a method to learn a statistical hand model by shaping a joint cross-modal latent space representation, but there is still room for improvement of physical plausibility.
Recent studies of hand-joint estimation are usually based on the uses of depth cameras [13, 14, 15, 34] or multi-view strategy [16, 17, 18]. Considering that the use of multiple cameras will limit the application scenarios, and depth images generated by the depth camera can only work properly in an indoor environment, people start to work on the hand pose estimation under single camera view.
In order to gain a comprehensive and effective 3D hand-joint dataset, people propose different modalities to build synthetic hand models, dealing with the problem of depth ambiguity [32, 33]. For example, Qian et al.  model a hand using spheres which are based on 26 degrees of freedom hand motion model  and define a fast cost function, achieving fast convergence and good accuracy. Likewise, reference  uses cones, spheres and ellipsoids to build the joint hand-object model. Moreover, numerous following works propose more holistic 3D hand models. For instance, reference
 employs a detailed skinned hand mesh of triangles and vertices to describe the observed data more accurately. Others focus on generative component optimization [35, 37] and the improvement of discriminator [19, 38, 39].
Although previous works in monocular RGB hand pose estimation achieve a good performance. Their performance might become limited by the difference of style between synthetic hand models and real ones.
Zimmermann et al.  introduce a large-scale 3D hand pose dataset based on synthetic hand models, yet the performance is limited by the lack of annotated large-scale dataset with real-world images and diverse pose statics. Similarly, Muller et al.  enhance synthetic dataset by using CycleGANs, while there is still a gap between real and synthetic hand model, so their 3D hand pose estimation architecture
is not robust enough. As a result, in this work, we address this problem by utilizing real images to extract 2D hand-coordinate features, eliminating the problem of image feature difference.
Description of hand model
In order to establish the mapping from 2D coordinates to 3D coordinates, we should prepare 2D and 3D hand joint data under different postures. As 2D coordinates are projected from 3D coordinates, we first build a 3D gesture model to generate 3D hand coordinate data.
The model consists of one root joint and 20 finger joints, which comprises fingertips without any degree-of-freedom. We denote � ∈ ℝ!”×$ as the 3D positions of all 21 hand joints, and � ∈ ℝ!% as the rotation angles of 15 finger joints with one or two degrees-of-freedom relative to base knuckle.
We denotes the metacarpophalangeal joints as 0-joint, proximal interphalangeal joints(PIJ) as 1-joints, distal interphalangeal joints(DIJ) as 2-joints, and fingertips as 3-joints. In order to better adapt to reality, we divide the palm into four planes.
There are four kinds of input parameters: finger length proportions, finger rotation angles, angles between finger root nodes and wrist, and angles between palm planes.
Finger length proportion (20 parameters in total), including the length of four parts of five fingers. For example, ��%、��“、��!、��$ of index finger as shown in Fig. 2. Each parameter is presented in the form of a finger length scale matrix, as shown in Formula (1).
In this matrix, we denote the thumb to the little finger (T, I, M, R, L) from top to bottom, and the length from bottom to top parts of each finger from left to right, in centimeters. The data in the matrix of formula (1) is for reference, and it can be adjusted in practice.
There are 20 parameters of finger rotation angles.
Including the up-down rotation angle of each finger joint and the left and right rotation angle of the finger root joint. Three up and down rotation angles of each finger become used to simulate finger joint bending. Downward bending is positive and upward bending is negative (the upward bending of fingers is less obvious), as shown in �&”, �&! and �&$ of Fig. 2. Equation (2) is the parameter matrix of the up and down rotation angles of the finger. Each finger has a left-to-right rotation angle, which is used to simulate a finger swinging left and right. Swinging to the left is negative, and swinging to the right is positive, as shown in �‘ and �‘( in Fig. 3. Equation (3) is the parameter matrix of the left and right rotation angle of the finger.
(a) Bend downwards
(b) Bend upwards
In Equation(2), the item with a subscript of 1 represents the angle between the �-0 node segment and the 0-1 node segment, the item with a subscript of 2 represents the angle between the 0-1 node segment and the 1-2 node segment, and the item with a subscript of 3 represents the angle between the 1-2 node segment and the 2-3 node segment .
The angles between sub palm planes (3 parameters in total) are used to simulate the rotation between the four sub palm planes, as shown in the angle γ in Fig. 4_ 1、γ_ 2. Equation (5) is the angle parameter matrix of the palmar plane.
According to the input parameters of the hand model, the computer program calculates the 3D coordinates of 21 hand joints.
Establishment and expansion of 2D-3D mapping data
After building the hand model, we successfully get all 3D joint coordinates. The coordinates (x, y, z) of each joint contain 3 parameters, and 21 joints contain 63 parameters. In order to get stereoscopic gestures and predict the 3D coordinates from the 2D coordinates of the gesture, we first need to project 3D coordinates of several determined gestures to 2D planes to obtain 2D coordinate data of 21 joints.
The projection from 3D to 2D equals to establishing the mapping from 3D coordinates to 2D coordinates. In this case, after getting 2D coordinates of corresponding 3D coordinates, we can realize the 2D-3D prediction based on the inverse mapping of 42 coordinate parameters to 63 coordinate parameters.
Considering that neural network learning is based on a huge amount of relevant data. Moreover, we need to prepare enough 2D-3D dataset for training 2D-3D neural networks in the following chapter. The most direct method is to build a large number of different 3D gesture models. And then project each gesture from a fixed direction to obtain a set of 2D-3D data. However, due to the high computing consumption and time-consuming process of creating 3D models. It is more efficient to project a 3D gesture model from different directions. To obtain multiple sets of 2D-3D data, which uses the method of data augmentation.
To image data, data augmentation is a kind of technique which generates similar derived images. Furthermore, to enlarge the size of the dataset. There are a series of random transformation techniques on images, including mirroring, clipping, rotating and something like that. In this chapter, we will first utilize a data augmentation technique to rotate an existing 3D gesture model in the coordinate system several times. In addition, then by projecting from the z-axis direction. We can obtain multiple sets of 2D coordinates of one 3D gesture. Finally, a large number of 2D-3D data can be obtained by projecting multiple 3D gesture models in this way.
The implementation process is described as below.
The rotation of a spatial coordinate system can be equivalently decomposed into the sequence rotation of three plane coordinate systems.
That is, fixing the x, y, z axes in turn and then rotating the other two coordinate axes.
Fig. 8 shows gestures generated according to the method described in this chapter. (a) (b)
Gesture recognition in 2D-3D
The neural network becomes mainly used in the area of machine learning. By fitting a large number of known input-output datasets into the optimal function. We can predict the unknown output according to the input parameters. In the process of 2D-3D gesture recognition, we utilize neural networks to complete the function fitting and coordinate prediction.
The 2D-3D gesture recognition refers to the fact that the 3D coordinate matrix (63 parameters in total). Of the corresponding gesture. Moreover, predictable by the trained neural network from the known two-dimensional coordinate matrix (42 parameters). Because there are 42 known parameters included in 63 parameters of the 3D coordinate matrix. We only have to predict 21 unknown parameters in fact. Therefore, when constructing the neural network. Furthermore, the output layer is a matrix of 1×42. While the input layer is a matrix of 1×2. The weights and thresholds, randomly preset, and 42 parameters of 2D coordinates become input.
After training the neural network for multiple times, we successfully fit the prediction function of 2D to 3D gesture. In the neural network of gesture recognition. We take the mean square loss function and random gradient descent method during the process of gradient backpropagation.
The weights become updated layer by layer.
And the 2D-3D mapping relationship trains. Using this projection relationship, restoration of 3D gestures become realized.
In practice, a big problem in gesture recognition is how to obtain 3D coordinates of gestures accurately. It’s hard to get 3D coordinates directly from images. While the use of contact or non-contact data acquisition equipment(such as data glove, infrared sensor, etc.) also leads to errors.
This obstacle hinders the development of gesture recognition. However, after the real gesture transformed into a virtual 3D model composed of multiple 3D coordinates. Moreover, we can solve this problem by training a specific neural network with 2D-3D mapping data. We first obtain 2D coordinate data from images of gestures. And then input it into the neural network to predict 3D coordinates of the specific 3D gesture. Lastly, this process realizes the 3D restoration of gestures, which lays the foundation for 3D gesture recognition.