Learning synergies based in‐hand manipulation with reward shaping

In-hand manipulation is a fundamental ability for multi-fingered robotic hands that interact with their environments. Owing to the high dimensionality of robotic hands and intermittent contact dynamics, effectively programming a robotic hand for in-hand manipulations is still a challenging problem. To address this challenge, this work employs deep reinforcement learning (DRL) algorithm to learn in-hand manipulations for multi-fingered robotic hands. A reward-shaping method is proposed to assist the learning of in-hand manipulation. The synergy of robotic hand postures is analysed to build a low-dimensional hand posture space. Two additional rewards are designed based on both the analysis of hand synergies and its learning history. The two additional rewards cooperating with an extrinsic reward are used to assist the in-hand manipulation learning. Three value functions are trained jointly with respect to their reward functions. Then they cooperate to optimise a control policy for in-hand manipulation. The reward shaping not only improves the exploration efficiency of the DRL algorithm but also provides a way to incorporate domain knowledge. The performance of the proposed learning method is evaluated with object rotation tasks. Experimental results demonstrated that the proposed learning method enables multi-fingered robotic hands to learn in-hand manipulation effectively.


Introduction
In-hand manipulation is an important ability of multi-fingered robotic hands [1]. To this end, a robotic hand uses its fingers to hold and manipulate an object within the hand. Owing to the high dimensionality of a robotic hand and intermittent contact dynamics, effectively programming in-hand manipulations for a robotic hand is still a challenging problem [2]. Previous methods, such as trajectory optimisation [3] and learning from demonstration [4][5][6] have been proposed to address this problem. The performance of these previous methods heavily relies on accurate dynamic models or high-quality human demonstrations that are difficult to obtain. Recently, deep reinforcement learning (DRL) algorithms have achieved stateof-the-art performance in a set of continuous control tasks [7,8]. DRL has the potential for learning manipulations directly from interactions with the environments. However, the sample complexity of DRL has become a key issue during learning complex manipulations [5], such as in-hand manipulation with a multi-fingered robotic hand.
In-hand manipulation can be understood as a process in which a robotic hand takes an action sequence using its fingers to move an object from an initial pose to a goal pose [1]. Palli et al. [9] proposed the understanding of in-hand manipulation as a derivation of a reference grasping posture. Traditionally, grasping postures are divided into two grasp types, i.e. power and precision grasps, which were first introduced by Napier [10]. Grasp type is a way of representing the manner in which a hand handles objects. Power grasps use the fingers and palm to hold an object firmly and precision grasps only use the fingertips to stabilise an object. Recently, Cini et al. extended Napier's grasp taxonomy and introduced a novel grasp taxonomy in which different human grasp types are presented [11]. Humans are naturally capable of operating objects by choosing a feasible grasp type from multiple possible grasp types. Based on the analyses of human grasping behaviour [9,12], some works have used, precision grasp posture, as guidance to achieve in-hand manipulations [13,14]. In these works, precision grasp is first sought in the configuration space of a robotic hand. Then it is used as guidance for the planning and control of in-hand manipulations. The use of the precision grasp as guidance helps to reduce the sample complexity of manipulation planning. In this work, we exploit grasp types as guidance to assist the learning of the in-hand manipulation with DRL. Different from previous works that require high-quality human demonstrations [4,15], this work only requires some human grasping postures. Grasp type is exploited as guidance for in-hand manipulation learning. The robotic hand is encouraged to explore a specific posture subspace instead of the whole configuration space, which reduces the sample complexity of the DRL.
DRL algorithms learn manipulations in the context of rewards received from interactions with the environment. Reward functions play a central role in specifying how an agent should act. However, the learning of complex manipulations with DRL algorithms is usually slow and unstable when it only uses an extrinsic reward. Reward-shaping methods [16] have been proposed to assist manipulation learning by introducing additional reward functions to augment the extrinsic reward. Various additional reward functions have been previously introduced according to different learning objectives (such as exploiting human demonstrations [17,18], human advice [19], or intrinsic motivation [20]). In this work, the learning agent is expected to achieve two sub-objectives: (i) exploring a grasping posture subspace under a specific grasp type (such as precision grasp) instead of the whole configuration space and (ii) fully exploring this specific posture subspace to avoid local convergence. Accordingly, two additional reward functions are designed. The additional rewards help to not only improve the exploration efficiency of DRL but also provide a way to integrate domain-specific concepts (i.e. the grasp type) into DRL.
The objective of this work is to endow a multi-fingered robotic hand to learn in-hand manipulation effectively. To this end, we propose a learning method that exploits grasp types as guidance for in-hand manipulation learning. Hand synergies are first analysed to a low-dimensional posture subspace. Then, a rewardshaping method is used to encourage exploring a specific posture subspace fully under a specific grasp type. Three different rewards, i.e. an extrinsic reward r ext , a hand-based reward r hand and an uncertainty-based reward r unc are designed. The r hand is defined based on the analysis of hand synergies and the r unc is computed based on the uncertainty of the state prediction. The three reward functions allow the agent to know more about their environment and task. Meanwhile, three value functions are trained jointly with respect to their reward functions and then they cooperate to optimise a control policy. Training each value function with respect to its own reward helps to improve the optimisation efficiency of the control policy. The experimental results demonstrated that the use of grasp types as guidance helps to improve the exploration efficiency of DRL and increases the success rate of the task execution.
The rest of this work is organised as follows: Section 2 presents related works. The necessary background of DRL algorithms is introduced in Section 3. Section 4 proposes the construction of low-dimensional hand posture space. The proposed learning method is presented in Section 5. Experimental evaluation and results are shown in Section 6. The conclusion and future works are presented in Section 7.

Related work
The in-hand manipulations using multi-fingered robotic hands require finding a feasible action sequence to change the object pose. Programming a multi-fingered robotic hand to achieve in-hand manipulation is still a challenging problem. Traditionally, trajectory optimisation-based methods have been proposed, which formulate the manipulation planning problem as a constrained optimisation problem [3,21]. Mordatch et al. [3] proposed a contact-invariant optimisation method for the synthesis of dexterous hand manipulation. In addition, Kumar et al. [21] employed a model predictive control algorithm to perform online trajectory optimisation for dexterous manipulation. Owing to the high dimensionality of the robotic hand and the non-linearity of the constraints, the optimisation formula of in-hand manipulation is difficult to construct accurately.
There are some research works that employed imitation learning-based methods to learn in-hand manipulation from human demonstrations. Jakel et al. [22] used multiple human demonstrations to learn a planning model for dexterous manipulations. In addition, Gupta et al. [4] took human-demonstrated motions as desired motions and used a policy search method to optimise a policy for in-hand manipulations. These methods largely rely on high-quality human demonstrations, which are difficult to obtain. Moreover, the structures of the human hand and robot hand are inconsistent.
More recent works focus on using DRL algorithms to learn in-hand manipulation. DRL learns in-hand manipulation directly from interactions with the environment. The control policy for in-hand manipulation is optimised by maximising the user-defined reward function. Zhu et al. [5] used a DRL algorithm to learn valve rotation, box flipping and door opening with a multifingered robotic hand. The OpenAI team demonstrated that a real physical shadow hand could learn dexterous manipulation using a DRL algorithm [23]. Compared with trajectory optimisation-based and imitation learning-based methods, DRL algorithms enabled agents to learn in-hand manipulation directly from interactions with their environment. However, one shortcoming of DRL algorithms is the sample complexity, which limits practical application in learning complex manipulations. Some works that exploited domain knowledge have been proposed to reduce the sample complexity of DRL algorithms. Peng et al. [24] incorporated human demonstrations into a model-free DRL. With a number of human demonstrations, the sample complexity of the DRL could be reduced. Exploration is a key process in manipulation learning with DRL, especially, for complex in-hand manipulations. There have been some works that designed heuristic exploration strategies to reduce the sample complexity of DRL. In addition, Achiam and Sastry [25] designed surprise-based intrinsic motivation for DRL. They formulated surprise as the Kullback-Leibler divergence of the state transition probability distribution to guide the exploration. The incorporation of domain knowledge learned from human demonstration or its learning history is crucial to improve the learning efficiency of DRL.
Understanding hand manipulation behaviour is important for programming in-hand manipulations. In-hand manipulation is a complex process which typically involves different control patterns, such as stable grasping, finger gating and finger coordination [26]. Previous works [9,27] suggested that complex in-hand manipulation could be generated from some simpler behaviours (such as grasping). Palli et al. [9] proposed understanding in-hand manipulation as a derivation of a reference grasp configuration. In addition, Odhner and Dollar [14] utilised precision grasp configuration for the planning of dexterous manipulation using under-actuated hand. The precision grasp configuration is taken as a reference for dexterous manipulation. Saut et al. [27] built a probabilistic roadmap in grasp subspaces and searched a trajectory in this roadmap for dexterous manipulation. These works demonstrated that using a specific grasp configuration as a prior helps to reduce the complexity of in-hand manipulation planning. Currently, Cini et al. introduced a novel grasp taxonomy including different human grasp types [11]. Human grasp types are classified into five groups: prismatic power, circular power, intermediate, prismatic precision, circular precision. The grasp types in the same group share the same characteristics. Grasp type is a way of representing how a hand handles objects. However, there are less works that use grasp type as a prior for in-hand manipulation learning. In this work, the domain-specific concept, i.e. the grasp type, is incorporated into the DRL algorithm to assist the learning of in-hand manipulation.

Background
In this section, we briefly introduced some preliminaries for reinforcement learning (RL). The RL setting is typically modelled as a Markov decision process given by a tuple: M = {S, A, R, P, r 0 } [ 28]. S [ R n and A [ R m denote the current state and action, respectively. R:s × a r is a reward function. r is the received reward after taking the action a under the state s. P:s × a s ′ is a transition dynamics model that outputs the probability distribution of the next state s ′ given the current state s and action a. r 0 is the initial probability distribution of the state. At the time step t, the agent executes an action a t given a current state s t . Then the agent observes the next state s t+1 and receives a reward r t . The objective of the RL is to optimise a control policy p:s a by maximising an expected discounted return, as defined in (1) where t = {s 0 , a 0 , ..., s T , a T } denotes a rollout data obtained from the interactions with the environment RL algorithms are based on some value functions. Typically, RL defines a state value V (s), a Q-value Q(s, a) and an advantage A(s, a) function as follows: In this work, we use proximal policy optimisation (PPO) algorithm [29], a state-of-the-art policy gradient method, as a baseline to learn in-hand manipulation. The PPO algorithm maintains two functions: value function V u v (s) and policy function p u p (s). Policy gradient methods typically suffer from catastrophically large updates. To update control policies stably, the PPO limits the magnitude of updates to the policy weight u p by imposing constrains on the difference between the new policy p u and old policy p u old .
The policy function is optimised by maximising the clipped surrogate objective as defined in (2). The objective constructs a trust-region around the old policy p u old by posing a lower bound on the improvement induced by an update whereÊ t denotes the empirical expectation over time steps.Â t represents the estimated advantage at time step t. 1 is a clip parameter which clips the estimated advantageÂ t . r t (u) = ((p u (a t |s t ))/ (p u old (a t |s t ) )) is a ratio between the probability of action under current policy p u and the probability of the action under the previous policy p u old .

Construction of low-dimensional hand posture space
Humans are capable of operating objects by choosing a feasible grasp type from multiple possible grasp types. Grasp type is a way to represent how a hand handles objects. Once a feasible grasp type is selected, robotic hands can make sense of how to use its fingers to operate the object. Cini et al. [11] introduced a GRASP taxonomy in which different human grasp types are presented. Considering the kinematic limitations of the robotic hand as well as Cini's GRASP taxonomy, this work considers six commonly used grasp types (i.e. large wrap, small wrap, power, pinch, precision and tripod) as shown in Fig. 1. Precision grasp is defined as the operation of an object using fingertip contacts alone. Moreover, this work exploits the precision grasp type as guidance for in-hand manipulation learning. The robotic hand is encouraged to explore a specific hand posture space under the precision grasp type instead of the whole configuration space of the robotic hand.
To construct the hand posture space, we need to collect a set of grasping postures, considering the six different grasp types. In this work, a tracking system is used to record human grasping postures that are further mapped to a multi-fingered robotic hand. The hand posture space is finally constructed based on the computed grasping postures of the robotic hand. During data collection, a user is supposed to perform object grasping tasks considering the six grasp types. Fig. 2a shows the collection of human grasping postures with a tracking system. The tracking system equips a data glove with active markers to record three-dimensional (3D) positions of the markers. These 3D marker positions are used to compute the grasping posture of the human hand. In total 12 household objects with different shape attributes are selected for the collection of human grasping postures, as shown in Fig. 2b.
After data collection, the human grasping postures are further mapped to a robotic hand. In this work, we use a Shadow Dexterous Hand, i.e. a five-fingered robotic hand, for in-hand manipulation experiments. The mapping process is achieved by using the inverse kinematic method from [30]. The Shadow Dexterous Hand has a total of 24 joints which includes 20 actuated degrees of freedom and four under-actuated movements. Hence, the whole configuration space of the Shadow Hand is a high dimension. Studies from neuroscience suggested that the central nervous system adopts a simplified strategy to coordinate a large number of degrees of freedom in motor control [31]. Synergies have been introduced as the strategy to understand the control of movement involving multiple degrees of freedom. In the past decade, synergies have also been widely used to analyse human grasping behaviour [32] and design advance grasp planning methods [33]. Previous works [9,33] demonstrated that the first two principal components contain >80% variance information of hand posture. That means only two or three combinations of principal components can be used to form the most common used grasp postures. Hence, a low-dimensional hand posture space for a robotic hand can be constructed based on the analysis of hand synergies.
This work also employs principal components analysis (PCA) algorithm to analyse the synergies of hand postures. Then a low-dimensional hand posture space for the shadow hand is constructed. The PCA algorithm is employed to compute the principal components for all the mapped hand postures. The hand posture of the robotic hand is defined as a vector of joint angles, i.e. c = {u i } i=1:24 . After the analysis of hand synergies, a hand posture c is expressed as a linear combination of uncorrelated variables {e i } i=1:n ,a sd e fined in (3). After processing the grasping postures of the shadow hand, the experimental result demonstrated that the first two principal components account for 91.4% of the variance. Hence, we use the first two principal components to construct the low-dimensional space, i.e. n = 2. The posture of the shadow hand could be represented completely with the amplitude vector A = [a 1 , a 2 ] where c m denotes a nominal hand posture.   This section presents the proposed learning method exploiting the precision grasp type as guidance for in-hand manipulation learning with a multi-fingered robotic hand. Two additional reward functions are first introduced based on the analysis of hand synergies and its learning history. In-hand manipulation is then learned using a proposed learning method. Fig. 4 shows the schematic diagram of the proposed learning method.

Reward shaping for efficiency exploration of DRL
This work takes advantage of the information extracted from both the analysis of hand posture synergies and the learning history for assisting manipulation learning. Two additional reward functions are designed to encode the extracted information. In this work, the proposed learning method uses three different reward functions: (i) an extrinsic reward r ext that specifies the task goal.
(ii) A hand-based reward r hand that encourages the agent to explore the state subspace under a specific grasp type (such as precision). (iii) An uncertainty-based reward r unc that is used to balance the trade-off between exploration and exploitation of the learning method. The extrinsic reward is generated according to the task requirement and the environment, which is defined in Section 6. The following subsections introduce the definition of the hand-based reward r hand and the uncertainty-based reward r unc r := {r ext , r hand , r unc } (4)

Hand-based reward based on the analysis of hand synergies:
This work exploits the precision grasp type as guidance for in-hand manipulation learning. Hand postures required for implementing in-hand manipulations are understood as a derivation from the precision grasp. As shown in Fig. 3, the hand posture space is divided into different regions according to six different grasp types. The posture subspace under the precision grasp type is a partial region of the whole posture space. To reduce the sample complexity of the DRL, the robot is encouraged to explore the specific posture subspace under the precision grasp type and ignore some infeasible states in other posture subspace (such as the subspace under the pinch grasp type).
To this end, this work uses a similarity measure to define the hand-based reward r hand , which drives the robot to explore the specific posture subspace under the precision grasp. The r hand is defined based on the Euclidean distance d hand between the amplitude vector A of the explored configuration c and the amplitude vector A precision center of the reference grasp configuration c precision center as r hand = where A precision center denotes the clustering centre that is obtained by clustering in the hand posture space, as shown in Fig. 3. Equation (5) shows the computation of the r hand . We take the explored hand posture as a precision grasp if d hand , 0.4 and set a high value to encourage the agent to execute the precision grasps.

Uncertainty-based reward based on uncertainty measure:
The Euclidian distance r hand encourages the robot to explore a specific hand posture subspace under the precision grasp. Meanwhile, the agent is preferred to explore this posture subspace fully to avoid a local convergence. Inadequate exploration could result in failing to find an effective control policy. Although this work adds random Gaussian noise to action selections, random  exploration is not a mark of efficiency, particularly for learning complex manipulations. Different heuristic-based exploration approaches have been previously proposed to improve exploration efficiency. Intrinsic motivated RL [20,25] is a popular method that defines different intrinsic motivation rewards to improve the exploration efficiency of the DRL. This work proposes to measure the uncertainty of the explored state as an intrinsic motivation of the agent. The robotic hand is encouraged to explore new states. The uncertainty of explored states is approximated by the prediction error of a transitional dynamic model. This work approximates the transitional dynamic model f u f (s t , a t ) by training a neural network, where u f denotes the weights of the network.
The dynamic model f u f (s t , a t ) takes current state s t and action a t as inputs and predicts the next state s t+1 , i.e. f u f :s t × a t ŝ t+1 ≃ s t+1 . The dynamic model f u f is trained by minimising a mean squared error loss, as where · 2 is the L2 norm and D is a minibatch data set received from the interactions with its environment. In this work, the uncertainty of the state is measured based on the prediction error of the dynamic model. At each time step t, given the collected state s t and action a t , the dynamic model f u f predict the next state s t+1 . We measure the Euclidean distance d unc between the explored next state s t+1 and the predicted next statê s t+1 to estimate the state uncertainty. Equation (7) shows the computation of the uncertainty-based reward r unc . The learning history described by the transitional dynamic model is used to design r unc . The designed r unc encourages the robot to try unexplored states with high prediction error of the dynamic model

Learning in-hand manipulation with multiple rewards
Different rewards could drive the robot to achieve different learning behaviour. Manipulation learning concerning multiple rewards is usually formulated as a multi-objective optimisation problem. Different from previous methods that train the robot directly concerning a composition reward, this work has three value functions with their own rewards. Three value functions are trained jointly concerning their rewards. Then they cooperate to optimise a control policy for in-hand manipulation. In this way, each value function captures different learning behaviour concerning its reward. For example, a hand-based function learns the formation of a precision grasp for in-hand manipulation; at the same time, the uncertainty-based function aims to try unexplored states to avoid local convergence. According to the three rewards r := {r ext , r hand , r unc }, this work jointly trains three value functions V u v := {V ext , V hand , V unc }. The value function V u v is approximated by training a deep neural network, where u v denotes the weight of the network. The value function V u v takes current state s t and predicted the state value v t = {v ext , e hand , v unc }, i.e. V u v :s t v t ≃ v t . In this work, each value function shares the low-level layers of the neural network and owns a separate high-level layer. Fig. 5 shows the network architecture of the value functions. The first two fully connected layers are used to extract the state feature and the last two independent fully connected layers are trained to achieve different learning behaviours. The parameters u v := {u v,0 , u v,ext , u v,hand , u v,unc } of the value function are optimised by minimising the following objective that is a linear combination of the three loss functions concerning the three value functions: where L ext , L hand and L unc denote the loss functions of the three value functions, respectively. y o denotes the estimated value of a value function computed with the temporal difference-l method [34].v o is the predicted value of the value function.
Then, the three value functions cooperate to compute the advantage A to optimise the control policy. The composition value V comp is first computed based on the three value functions. The V comp is a linear combination of the predicted values of the three value functions as where b 1 , b 2 and b 3 denote the weights that determine the importance of the value functions during manipulation training. We can also demonstrate that the policy optimisation with respect to the V comp is consistent to the optimisation with respect to the composition reward function r comp = b 1 r ext + b 2 r hand + b 3 r unc ,a s defined in (10). The multi-objective optimisation of the value Next, this work uses the generalized advantage estimator (GAE)-l method [34] to estimate the advantage A. Given the composition value V comp and composition reward r comp , the advantage A is estimated by The control policy p u p , the value function V u v and the transitional dynamic model f u f are jointly optimised to minimise the following loss, as defined in (11). The L clip (u p ), L(u v ) and L(u f ) denote the loss function of the policy function, the value function and the dynamic model, respectively The pseudocode of the proposed learning method with reward shaping is shown in Algorithm 1. Initially, the robotic hand interacts with its environment to collect training data by running the current policy p u p . The two additional rewards (i.e. r hand and r unc ) together with the extrinsic reward r ext are computed at each time step. After collecting the training data, the policy function, value function and dynamic model are jointly updated with the sampled training data.  Fig. 6. Object rotation tasks were used for the evaluation. In this experiment, two different objects (i.e. a block and an egg) were used. A Shadow Dexterous Hand [https://www. shadowrobot.com/products/dexterous-hand/] was used to manipulate an object from an initial pose to a target pose by using its fingers. The Shadow Hand is a 24-DoF manipulator which five fingers with 22 joints and a wrist with two joints.
The state of the robotic hand is a 68D variable which includes the hand joint angles, the joint velocity, the object pose, the object velocity and the object target pose. The action was a 24D variable which was the relative angle of hand joints w.r.t. the current angle. The initial state of the agent and the target object pose were randomly chosen at the beginning of each episode. The objective is to manipulate the object to reach the desired orientation. Hence, the external reward r ext was computed based on the difference d ext between current object orientation and the target object orientation, as defined in (12). Where the object orientation was represented as a quaternion q. The q curr ⊖q target denoted the difference between the current quaternion q curr and the target quaternion q target . The two rotation tasks used the same r ext which encourages the agent to manipulate an object to reach the target pose. We assume that the object reaches the goal pose if d ext . −0.1 and set a high value to encourage the robot to implement the object rotation tasks successfully

Training details:
The control policy p u p was also approximated by training a neural network, where u p denotes the weight of the network. The policy p u p has taken the current state s t and outputted the predicted action a t , i.e. p u p :s t a t . The policy network used three fully connected layers. Each layer had 256 hidden units and used the tanh activation function for all the hidden units.
The training parameters of the proposed learning method were set as follows. The number of epochs N was 2500. The number of optimisation steps N opt was 10. The length of rollout N rollout was 2048. The maximum length of each episode N epi was 100. During the policy training, the batch size was set as 256. The initial learning rate was set as 0.0001, and it was linearly decreased as the learning process. The discount factor g was 0.99, and the clip parameter 1 was 0.2. The two parameters of the GAE method are set as l = 0.95 and g = 0.99. The stochastic gradient descent method with a momentum rate of 0.9 was employed to optimise the weights of the neural networks. The three parameters were set as b 1 = 1, b 2 = 0.5 and b 3 = 0.1, which weighted the importance of the r ext , r hand and r unc , respectively.

Experimental results
First, the overall performance of the proposed learning method was evaluated with object rotation tasks. We compared our results with four categories of baselines. The four algorithms were evaluated in the two object rotation tasks mentioned above. Fig. 7 shows the performance curve of the four algorithms in the object rotation tasks. The learning performance of these algorithms was evaluated through the number of successes in each rollout. The high number of successes meant that the robot has a high chance of achieving object rotation tasks successfully. First, from the experimental results, it can be seen that the PPO algorithm failed to obtain a high success rate in the block rotation task. The manipulation learning with the PPO is relatively slow. Second, it can be seen that the PPO + hand and PPO + unc algorithm performed better than the PPO algorithm because the two algorithms obtained a higher success rate. The two additional reward functions all helped to improve exploration efficiency of the PPO algorithm. Meanwhile, we also noticed that the learning speed of the PPO + hand algorithm was faster than that of the PPO + unc algorithm. The reason was that the hand-based reward r hand drives the robot to use the feasible hand configurations to rotate the objects. Hence, the learning time required by the PPO + hand algorithm was reduced. Finally, compared with the three other algorithms, the proposed method obtained the best learning performance because it obtained the highest success rate. The robotic hand with the guidance of r hand had a high chance to try the states with high probability success. The incorporating of the domain knowledge and information extracted from the learning history helped to reduce sample complexity and improve the exploration efficiency. Fig. 8 illustrates the in-hand manipulation process.
Learning in-hand manipulation using DRL algorithms is usually a time-consuming process. DRL algorithms in the context of an extrinsic reward are usually slow and unstable, especially when the extrinsic reward is sparse. Previous work has demonstrated that the use of prior domain knowledge helps to improve the performance of manipulation learning using DRL algorithms [5,15]. However, it is still a challenging problem to represent and use domain knowledge in DRL algorithms. In this work, we exploited an abstraction of hand grasping posture (i.e. grasp type) as guidance to assist in-hand manipulation learning. Grasp types are taken as domain knowledge forms that convey important information that represents the manner in which a hand handles objects.
A reward-shaping method is used to fuse the domain knowledge into a DRL algorithm. Domain knowledge then encourages the agent to attempt reasonable actions. The comparison results demonstrated that the proposed learning method can enable the Shadow Hand to learn in-hand manipulations effectively.  Next, the use of hand-based reward is also evaluated in the proposed learning method. In this work, the hand-based reward r hand is designed to encode the grasp type information to assist the manipulation learning. The precision grasp configuration was taken as guidance for in-hand manipulation learning. The average episode returns R hand concerning r hand could be used to evaluate the manipulation ability of the robotic hand. The high r hand meant that the agent had a strong ability to form its precision hand configuration. Fig. 9 shows the average episode return R hand during running the two algorithms (i.e. the PPO + hand and the proposed algorithm). The return R hand was increasing as the learning progresses, which means that the robot was able to form and use the precision hand configuration which produced a high success rate of object rotation. It can be seen that the proposed algorithm obtained a faster learning speed than the PPO + hand algorithm. It showed that the r hand encouraged the agent to explore the state space where the states had a high hand-based reward opportunity. It also can understand that the use of domain knowledge helped the agent to quickly learn how to control its fingers to achieve the object rotation tasks successfully. Hence, given the hand-based reward r hand as a guidance the Shadow Hand had a high chance to find feasible hand configuration for in-hand manipulation.
Finally, previous methods mainly used a composition reward computed from multiple rewards to learn a control policy [35]. In the proposed learning method, three value functions were jointly trained concerning their reward function, respectively, and then cooperated to optimise the control policy. We want to demonstrate the effectiveness of learning independent value functions. This work compared the proposed learning method with the learning method having a composition reward, named as PPO + composition. The PPO + composition method is also based on the PPO algorithm, but it used a composition reward that is a linear combination of the three rewards defined in Section 5. Fig. 10 shows the performance curve of the two algorithms in the two object rotation tasks. The learning performance of the two algorithms was evaluated through the number of successes in each rollout. From the figure, it can be seen that the proposed algorithm obtained a clear performance boost over the PPO + composition method. The optimisation concerning a composition reward was a multi-objective optimisation problem. Training three independent value functions jointly helped to scalarise the multi-objective optimisation. Each reward function typically represented an individual desirable learning behaviour of the agent. Experimental results showed that the training of the agent (i.e. the optimisation of the value function) concerning individual rewards could be fast.

Conclusion and future work
This work presents a learning method for a multi-fingered robotic hand to learn in-hand manipulation. Information extracted from  both the analysis of hand posture synergies and the learning history is used to guide in-hand manipulation learning. A reward-shaping method is introduced to take advantage of the extracted information. Two additional reward functions (i.e. a hand-based reward and an uncertainty-based reward) are designed. The hand-based reward is to drive the robotic hand to explore a specific hand posture subspace with high success probability. The uncertainty-based reward is to encourage the robotic hand to try unexplored states so as to balance the trade-off between exploration and exploitation. Meanwhile, three value functions are trained jointly with respect to their reward functions. Then, they cooperate to optimise a control policy. The performance of the proposed learning method was evaluated with object rotation tasks.
The experimental results showed that the proposed learning method allows the robotic hand to achieve the object rotation tasks effectively. The reward shaping not only improves the exploration efficiency of the DRL algorithm but also provides a way to incorporate domain knowledge. Moreover, this work demonstrated that exploiting domain knowledge (such as grasp type) is effective for learning in-hand manipulations with a multi-fingered robotic hand.
The problem of learning in-hand manipulation effectively is far from being solved. In this work, the robotic hand is allowed to explore all the states in the configuration space. However, it is impractical for a real robotic system, because the unreasonable states may break the system. Hence, safety constraints should be considered in the DRL algorithm, which restricts the learning agent to explore the reasonable states. Hence, it is essential to develop a safe exploration algorithm for DRL algorithms in the future.