Abstract
We present a novel vision-based, 6-DoF grasping framework based on Deep Reinforcement Learning (DRL) that is capable of directly synthesizing continuous 6-DoF actions in cartesian space. Our proposed approach uses visual observations from an eye-in-hand RGB-D camera, and we mitigate the sim-to-real gap with a combination of domain randomization, image augmentation, and segmentation tools. Our method consists of an off-policy, maximum-entropy, Actor-Critic algorithm that learns a policy from a binary reward and a few simulated example grasps. It does not need any real-world grasping examples, is trained completely in simulation, and is deployed directly to the real world without any fine-tuning. The efficacy of our approach is demonstrated in simulation and experimentally validated in the real world on 6-DoF grasping tasks, achieving state-of-the-art results of an 86% mean zero-shot success rate on previously unseen objects, an 85% mean zero-shot success rate on a class of previously unseen adversarial objects, and a 74.3% mean zero-shot success rate on a class of previously unseen, challenging "6-DoF" objects.Raw footage of real-world validation can be found at https://youtu.be/bwPf8Imvoo