Quadrotors are among the most agile and dynamic machines ever created. In the hands of a skilled human pilot, they can do some astonishing series of maneuvers. And while autonomous flying robots have been getting better at flying dynamically in real-world environments, they still haven’t demonstrated the same level of agility of manually piloted ones.
Now researchers from the Robotics and Perception Group at the University of Zurich and ETH Zurich, in collaboration with Intel, have developed a neural network training method that “enables an autonomous quadrotor to fly extreme acrobatic maneuvers with only onboard sensing and computation.” Extreme.
There are two notable things here: First, the quadrotor can do these extreme acrobatics outdoors without any kind of external camera or motion-tracking system to help it out (all sensing and computing is onboard). Second, all of the AI training is done in simulation, without the need for an additional simulation-to-real-world (what researchers call “sim-to-real”) transfer step. Usually, a sim-to-real transfer step means putting your quadrotor into one of those aforementioned external tracking systems, so that it doesn’t completely bork itself while trying to reconcile the differences between the simulated world and the real world, where, as the researchers wrote in a paper describing their system, “even tiny mistakes can result in catastrophic outcomes.”
To enable “zero-shot” sim-to-real transfer, the neural net training in simulation uses an expert controller that knows exactly what’s going on to teach a “student controller” that has much less perfect knowledge. That is, the simulated sensory input that the student ends up using as it learns to follow the expert has been abstracted to present the kind of imperfect, imprecise data it’s going to encounter in the real world. This can involve things like abstracting away the image part of the simulation until you’d have no way of telling the difference between abstracted simulation and abstracted reality, which is what allows the system to make that sim-to-real leap.
The simulation environment that the researchers used was Gazebo, slightly modified to better simulate quadrotor physics. Meanwhile, over in reality, a custom 1.5-kilogram quadrotor with a 4:1 thrust to weight ratio performed the physical experiments, using only a Nvidia Jetson TX2 computing board and an Intel RealSense T265, a dual fisheye camera module optimized for V-SLAM. To challenge the learning system, it was trained to perform three acrobatic maneuvers plus a combo of all of them:
All of these maneuvers require high accelerations of up to 3 g’s and careful control, and the Matty Flip is particularly challenging, at least for humans, because the whole thing is done while the drone is flying backwards. Still, after just a few hours of training in simulation, the drone was totally real-world competent at these tricks, and could even extrapolate a little bit to perform maneuvers that it was not explicitly trained on, like doing multiple loops in a row. Where humans still have the advantage over drones is (as you might expect since we’re talking about robots) is quickly reacting to novel or unexpected situations. And when you’re doing this sort of thing outdoors, novel and unexpected situations are everywhere, from a gust of wind to a jealous bird.
For more details, we spoke with Antonio Loquercio from the University of Zurich’s Robotics and Perception Group.
IEEE Spectrum: Can you explain how the abstraction layer interfaces with the simulated sensors to enable effective sim-to-real transfer?
Antonio Loquercio: The abstraction layer applies a specific function to the raw sensor information. Exactly the same function is applied to the real and simulated sensors. The result of the function, which is “abstracted sensor measurements,” makes simulated and real observation of the same scene similar. For example, suppose we have a sequence of simulated and real images. We can very easily tell apart the real from the simulated ones given the difference in rendering. But if we apply the abstraction function of “feature tracks,” which are point correspondences in time, it becomes very difficult to tell which are the simulated and real feature tracks, since point correspondences are independent of the rendering. This applies for humans as well as for neural networks: Training policies on raw images gives low sim-to-real transfer (since images are too different between domains), while training on the abstracted images has high transfer abilities.
How useful is visual input from a camera like the Intel RealSense T265 for state estimation during such aggressive maneuvers? Would using an event camera substantially improve state estimation?
Our end-to-end controller does not require a state estimation module. It shares however some components with traditional state estimation pipelines, specifically the feature extractor and the inertial measurement unit (IMU) pre-processing and integration function. The input of the neural networks are feature tracks and integrated IMU measurements. When looking at images with low features (for example when the camera points to the sky), the neural net will mainly rely on IMU. When more features are available, the network uses to correct the accumulated drift from IMU. Overall, we noticed that for very short maneuvers IMU measurements were sufficient for the task. However, for longer ones, visual information was necessary to successfully address the IMU drift and complete the maneuver. Indeed, visual information reduces the odds of a crash by up to 30 percent in the longest maneuvers. We definitely think that event camera can improve even more the current approach since they could provide valuable visual information during high speed.
You describe being able to train on “maneuvers that stretch the abilities of even expert human pilots.” What are some examples of acrobatics that your drones might be able to do that most human pilots would not be capable of?
The Matty Flip is probably one of the maneuvers that our approach can do very well, but human pilots find very challenging. It basically entails doing a high speed power loop by always looking backward. It is super challenging for humans, since they don’t see where they’re going and have problems in estimating their speed. For our approach the maneuver is no problem at all, since we can estimate forward velocities as well as backward velocities.
What are the limits to the performance of this system?
At the moment the main limitation is the maneuver duration. We never trained a controller that could perform maneuvers longer than 20 seconds. In the future, we plan to address this limitation and train general controllers which can fly in that agile way for significantly longer with relatively small drift. In this way, we could start being competitive against human pilots in drone racing competitions.
Can you talk about how the techniques developed here could be applied beyond drone acrobatics?
The current approach allows us to do acrobatics and agile flight in free space. We are now working to perform agile flight in cluttered environments, which requires a higher degree of understanding of the surrounding with respect to this project. Drone acrobatics is of course only an example application. We selected it because it makes a stress test of the controller performance. However, several other applications which require fast and agile flight can benefit from our approach. Examples are delivery (we want our Amazon packets always faster, don’t we?), search and rescue, or inspection. Going faster allows us to cover more space in less time, saving battery costs. Indeed, agile flight has very similar battery consumption of slow hovering for an autonomous drone.
“Deep Drone Acrobatics,” by Elia Kaufmann, Antonio Loquercio, René Ranftl, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza from the Robotics and Perception Group at the University of Zurich and ETH Zurich, and Intel’s Intelligent Systems Lab, was presented at RSS 2020.