Learning Deep Sensorimotor Policies for Vision-Based Autonomous Drone Racing

The development of effective vision-based algorithms has been a significant challenge in achieving autonomous drones, which promise to offer immense potential for many real-world applications. This paper investigates learning deep sensorimotor policies for vision-based drone racing, which is a particularly demanding setting for testing the limits of an algorithm. Our method combines feature representation learning to extract task-relevant feature representations from high-dimensional image inputs with a learning-by-cheating framework to train a deep sensorimotor policy for vision-based drone racing. This approach eliminates the need for globally-consistent state estimation, trajectory planning, and handcrafted control design, allowing the policy to directly infer control commands from raw images, similar to human pilots. We conduct experiments using a realistic simulator and show that our vision-based policy can achieve state-of-the-art racing performance while being robust against unseen visual disturbances. Our study suggests that consistent feature embeddings are essential for achieving robust control performance in the presence of visual disturbances. The key to acquiring consistent feature embeddings is utilizing contrastive learning along with data augmentation. Video: https://youtu.be/AX_fcnW9yqE


I. INTRODUCTION
Autonomous drones can travel through complex and dynamic environments at very high speed, holding great potential for a wide range of applications, such as industrial inspection, search and rescue, and reconnaissance.Robust vision-based autonomous flight is key to this goal.Autonomous visionbased drones have made significant progress in recent years, continuously pushing the vehicle to higher speeds and better robustness.Several competitions have been organized to push the limit, such as the IROS 2016-19's Autonomous Drone Racing series [1], NeurIPS 2019's Game of Drones [2], the 2019 AlphaPilot Challenge [3], [4], and the ICRA 2022's DodgeDrone Challenge.
Vision-based autonomous drone racing requires operating the vehicle on the edge of its physical limits, thereby coping with the motion blur and the rapid illumination changes induced by the high speeds and quick rotations of the camera.The tolerance of the system for mistakes is extremely low: any small error can lead to a crash.
While existing works on vision-based autonomous drone racing rely on globally-consistent state-estimation, planning, and control [1]- [5], human pilots race drones by relying solely on a video stream from the drone's onboard camera, that is, by directly mapping visual input to control commands.While human pilots build a mental model of the drone state, they do not perform any explicit state estimation or trajectory planning [6].In this paper, we make a small step toward emulating human pilots by learning a deep sensorimotor policy for vision-based autonomous drone racing.
Recent progress in the robot learning community demonstrates that learning deep sensorimotor policies for robotic tasks is feasible.Methods of this kind usually predict control commands directly from information extracted from highdimensional sensory inputs.Deep sensorimotor policies have been heavily investigated in many robotic domains, such as object manipulation using robot arms [7], [8] or benchmark control of simulated robots [9]- [11].This line of works has the advantage that the policy algorithm relaxes the need for a globally-consistent state information and enlarges the application of the system.However, learning deep sensorimotor policies for vision-based navigation still faces several challenges, including high sample complexity and poor generalization.
An overview of our system is given in Fig. 2. Our main contribution is a deep sensorimotor policy that can jointly solve perception, planning, and control for autonomous drone racing, without relying on an globally-consistent state of the drone nor on trajectory planning.The inputs to our policy are a sequence of images and part of the drone state (orientation, velocity, acceleration) but no globally-consistent position information.
Our method consists of two key components: privileged policy training and robust feature learning.First, we leverage a two-stage learning-by-cheating framework for policy training.Second, we use contrastive learning and data augmentation to extract robust image embeddings from RGB images.
Furthermore, we compare our vision-based deep sensorimotor policy against a neural control policy that utilizes the full globally-consistent state information [12].Our experiments, conducted in a realistic simulator [13], show that our visionbased deep sensorimotor policy achieves the same level of racing performance while being robust against different visual disturbances and distractors.Finally, we benchmark the performance of our vision-based policy against the time-optimal trajectory generation algorithm [14], which offers a theoretical minimum time.Our policy achieves lap times close to the time-optimal solution.

II. RELATED WORK
Different approaches have been studied to tackle autonomous drone racing.State-based methods that rely on globally accurate position information have been used extensively.Foehn et al. [14] presented the time-optimal trajectory generation by jointly improving the time allocation and the trajectory.The algorithm enabled them to outperform human experts in drone racing.In [12], [15], [16], authors used reinforcement learning to train a neural network as the policy.For example, Song et al. [12] utilized relative gate positions towards the next gates to achieve near-time-optimal performance.Nagami et al. [16] initialized a network by mimicking a simplified controller and further trained it with reinforcement learning.The hierarchy allowed the policy to outperform a trajectory planning policy.Although promising results can be generated by state-based methods, the assumption of exact position information limits the application of the methods.
Prior work on vision-based drone racing decouples the perception, planning, and control modules.In the work of Foehn et al. [3], visual-inertial odometry (VIO) was fused with a CNN-based gate corner detection for robust state estimation.A receding horizon path planner generates a timeoptimal trajectory using motion primitives based on a pointmass model of the drone platform.However, the point-mass assumption cannot represent the true actuation limits of the drone and may lead to dynamically infeasible trajectories.In [17]- [19], authors first use data-driven methods to train the neural networks that can predict the waypoint and the desired speed.Afterward, a minimum jerk trajectory is planned for passing through the waypoint and then tracked by a low-level controller.Muller et al. [20] propose to train a neural network for local trajectory planning, in which a downstream control policy is used to track the trajectory and generate low-level commands for vehicle control.The trajectory labeling requires additional engineering efforts and can result in ambiguity as each image can be labeled with different trajectories.The decoupling of the perception, planning, and control modules inevitably involves simplified assumptions or manual design of parameters, leading to sub-optimality during high-speed flight.
Recent advance in data-driven control [21]- [24] indicates the potential of developing autonomous systems using sensorimotor control, in which a neural network policy can map high-dimensional sensory inputs directly to control commands.However, with naive training, the deep sensorimotor policy might suffer from poor generalization when facing unseen disturbances.Different approaches have been employed to alleviate the overfitting issue, such as data augmentation [11], [25], injecting known biases [26], and extracting invariant information [27].Most of them are applied to video games [10], [26], robot arm control [25], [28], or autonomous driving [29].The generalization capability of a neural network policy for minimum-time flight has drawn much less attention due to several challenges, such as low reaction time and a rapid change of the image observations.

III. METHODOLOGY
An overview of our method is visualized in Fig. 2. Our approach consists of two key components: policy training and feature learning.The policy training is done using privileged reinforcement learning and imitation learning, where a student policy mimics the action of a teacher policy.To process highdimensional image data and allow efficient policy training, we use YOLO [30], [31] to extract low-dimensional image embeddings.

A. Policy Training
Teacher Policy Training: The first step is to obtain a state-based teacher policy that can push the vehicle to its maximum performance.We use reinforcement learning to train a multilayer perceptron (MLP) policy π teacher for passing through a sequence of gates At every time step t, the agent is at state s t and receives information about the gate state g t .Our goal is to find the optimal policy π * teacher that maximizes the expected discounted return: , where γ is the discount factor and r t is the reward at time stage t.In privileged learning, the teacher policy has access to all ground truth information, including the vehicle's state s t and the gate state g t .Hence, the teacher policy generates an action āt ∼ π(s t , g t ) given both states.The policy outputs control commands in the form of mass-normalized collective thrust and angular velocity.
The main objective is to minimize the lap time, which is equivalent to maximizing the path progress along the center line connection between two consecutive gates [12].In addition, we maximize a perception-aware reward to maximize the visibility of the next gate.The perception-aware reward incentivizes the policy to face the camera toward the Fig. 2: Overview of our policy training method.We first train a teacher policy with access to privileged state information using model-free reinforcement learning.This teacher policy is then distilled into a student policy, which is trained to do perception, planning, and control jointly.next passing gate, which is crucial for vision-based flight since our environment is only partially observable when using a camera.We denote the position and velocity of the center of the next gate on the image plane by p c and ṗc , respectively.The perception award reward is formulated to keep the gate in the image center and reduce the motion blur [32] as

Student Policy Training:
After we obtain a teacher policy π * teacher (s t , g t ) that can race the drone optimally, we distill the teacher's knowledge to a student policy π student (s t , o t ) that does not have access to the privileged information about the environment.Specifically, the student policy can only observe part of the drone state s t , which does not contain the vehicle's global position, and need to infer the gate information from the camera observation o t → g t .There are three key components of our student policy: a feature extractor, a memory-based neural network, and a policy network.
We use YOLO [30], [31] as the feature extractor and train it to detect all gates in a given image.We use average pooling to downsample the output of the three convolutional layers in its detection head and concatenate them as the embedding of the image z t .Since the detection head is the rightmost module of YOLO, this embedding contains all the information for detecting the gates, and hence, is a sufficient for representing the image.Note that we additionally normalize the embed-ding with l2-normalization, which is empirically found to be beneficial for the convergence of policy training.
When using a single camera, the environment becomes a partially observable environment.To this end, we use a temporal convolutional network (TCN) [33] for the policy representation.The embedding from the image is concatenated with the truncated vehicle state s t .The sequence of concatenated embeddings is then fed into the TCN to extract temporal information from history observations.Finally, we use a MLP to regress the control command.The MLP takes the output of the TCN as input and produces the student policy's action which is of the same format as the teacher policy.
We use imitation learning to train the student policy.We define an action loss L A that is the mean square error between the outputs of the teacher policy and the student policy.
In addition, to better enable the knowledge transfer between the teacher policy and the student policy, we also add a latent loss L E to supervise the output of the TCN e t with the intermediate embedding of the teacher ēt , written as . Therefore, we minimize the total loss for the imitation learning where λ is a coefficient to weight the latent loss.

B. Robust Feature Learning via Data Augmentation
To learn robust image embeddings against disturbance, we train the encoder with contrastive learning.We use the framework introduced in [34].The framework (shown in Fig. 3) contains an online network and a target network.The online network defined by parameters φ contains three components: an encoder f φ , a projection g φ , and a predictor q φ .The target network is an exponential mean average of the online network and defined by parameters ξ.It is comprised of two components: a target encoder f ξ and a target projection g ξ .
An input image o is passed through two augmentations, denoted as t and t , to obtain two augmented views v = t(o) and v = t (o) separately.Then the embedding prediction z φ (v) = q φ (g φ (f φ (v))) is extracted by the online network while the embedding target z ξ (v ) = g ξ (f ξ (v )) is extracted by the target network.A cosine similarity loss is applied to align the embeddings, where || • || 2 denotes l2-normalization.

IV. EXPERIMENTS
We design our experiments to answer the following research questions: 1) our student policy does not depend on the vehicle's global position and can only observe the environment partially; how does such a vision-based control system compare to a state-based system? 2) our policy is trained with data augmentation.Can the data augmentation align image embeddings given different visual disturbances and distractors, and is the policy robust against those disturbances?3) our policy still relies on some part of the vehicle state; how well can the policy handle estimation errors in the drone state?

A. Experimental Setup
Simulator Environment: We conduct experiments using the Flightmare [13] simulator, a realistic quadrotor simulator with various racing tracks and realistic racing environments.We set up three different race tracks (Circle, Figure8, and SplitS) in a warehouse environment (see the visualization in Fig. 4 left).For training the teacher policy, we use a customized implementation of the proximal policy optimization algorithm (PPO) [35] based on the code from [36].For training the student policy, we implement an imitation learning pipeline.For learning robust feature representations from raw images, we use data augmentation with random convolution and random cutout-color (see Fig. 4

middle and right).
Evaluation: To evaluate our policy, we rollout for 10 episodes, with quadrotor starting from different starting positions, which are sampled from a uniform distribution between -0.1m and 0.1m in x, y, z-axis of each.We evaluate the performance of our policy using two different metrics: Lap Time and Success Rate.The lap time indicates the racing performance of our policy, while the success rate indicates the robustness of the policy.We report the lap time by computing the time required by the policy to finish one complete track and calculate the success rate by calculating the ratio that the quadrotor can successfully finish one full lap without crashing among the ten rollouts.

B. Baseline Comparisons
We compare our vision-based policies against two baselines: a state-based learning-based policy [12] and a time-optimal trajectory [14].The state-based policy controls the drone using ground truth information about the drone state, including position, velocity, orientation, and acceleration, as well as the pose of the next two gates.The time-optimal trajectory serves as the theoretical minimum bound for our platform.The student policy does not have access to ground truth information about the drone's position and the gate poses.Instead, it uses a camera to capture RGB images and controls the drone directly using the image.Therefore, the student policy can only observe the environment partially, similar to how human pilots control the drone using the first-person-view camera.The result is shown in Table I.A visualization of the trajectories is given in Fig. 5.Both policies achieve strong performance on three different race tracks with high success rates.The student policy learns to cut corners, resulting in lower lap time, but more risky behaviors.
In reality, the vehicle states are prone to error due to the drift in state estimation and measurement errors.Despite impressive results in visual-(inertial) odometry in recent years, high-speed flight with six degrees of freedom motion remains challenging for existing estimation algorithms [37].Hence, the state-based control system is subject to failure since the policy relies heavily on position estimation.We investigate this problem using a simulated VIO pipeline, in which we simulate position drift.Fig. 6 shows how position drifts affect the performance of the state-based policy.Given perfect state information, the policy achieves 100 % success rates on all tracks.However, as we increase the drift in position, the success rates collapse quickly.The VIO drift can be alleviated by relocalizing with respect to the gates but this is challenging because the camera suffers from motion blur and limited field of view.On the other hand, our vision-based policy is not affected by the drift in the position since it does not rely on that information.

C. Handling Visual Disturbances and Unseen Distractors
We deploy our vision-based system in various unseen contexts to investigate how it performs against unseen visual disturbances, such as environments with color changes, brightness changes, and environments with many randomly arranged unseen objects.We darken the environment by lowing the brightness value from 1 to 0.5 and 0.8, and we also change environment colors by tuning the image hue value from 0 to both 0.5 and -0.5.Fig. 1 left and middle-left provide examples of environment with brightness values of 0.5 and hue values of 0.5, respectively.In addition, we also place some visual distractors randomly around the environment, including blue boxes that are similar in color and shape to the racing gates (see visualization in Fig. 1 middle-right), and some random objects with irregular shapes (see visualization in Fig. 1

right).
As presented in Table II, our system is robust against various types of visual disturbances while still maintaining a high success rate and comparable lap time on all three racing tracks, which demonstrates the effectiveness of our image feature learning mechanism.In the following section, we further investigate how the image encoder trained using contrastive learning can generalize to these visual disturbances.

D. Aligning Image Embeddings
To ensure robust feature extraction, we use contrastive learning (Setion III-B).In the contrastive learning framework, the similarity loss ensures that the encoder learns the invariance between the two augmented views.As a result, the image embeddings between augmented views are aligned in the embedding space.We choose random convolution as the augmentation for hue changes and brightness changes; and use random cutout-color against distractors, such as blue boxes and random objects.In Fig. 7, we present the qualitative results of aligning image embeddings between augmentations and disturbances.For each of the three tracks, we collect the images along the flight trajectory of the teacher policy with either augmentations or disturbances.For each of the trajectories, we extract the image embeddings with the YOLO encoder and reduce the dimension to 2 with t-distributed stochastic neighbor embedding (t-SNE).The embeddings of each test-time disturbance are then evaluated with those of the corresponding augmentation during training.We can observe that the image embeddings of all the disturbances are well aligned with those of the corresponding augmentations.It ensures that our policy receives matching image embeddings in test time and behaves robustly.Thus, our policy still maintains a high success rate under all the disturbances (Table II).

E. Handling Noisy State
Our policy relies on part of the drone state, including the orientation, linear velocity, and acceleration.The state information can be estimated using measurements from onboard sensors, such as IMUs, which are usually noisy.We further investigate the robustness of our policy against disturbances on the truncated sates by adding Gaussian noise N (0, std) individually to each component of the states, where std is the standard deviation.Table III shows the result.We can observe that the success rate decreases mildly when the standard deviation increases, which proves that our policy is robust against noises from the sensor measurements.

V. DISCUSSION AND CONCLUSION
This work presented a method to learn deep sensorimotor policies for vision-based autonomous drone racing.We showed that a vision-based control policy allows predicting control commands with information extracted from images without explicitly estimating position information, trajectory planning, and tracking.The vision-based policy can achieve the same level of racing performance as the state-based policy while being robust against different visual disturbances and distractors.On the other hand, a state-based control policy is sensitive to position errors in state estimation.The key to achieving robust sensorimotor control is to learn well-aligned image embeddings using contrastive learning and data augmentation.These findings suggest that deep sensorimotor control has the potential for vision-based agile drone flight and merits further investigation.
A major limitation of the presented work is a lack of real-world experiments to demonstrate the effectiveness and robustness of our vision-based policy.The deployment of the student policy on a real drone still requires further research on transfer learning or adaptive learning.Although relaxing the need for globally-consistent position information about the drone and the gate, the student policy still relies on part of the vehicle's state to predict the control commands.We plan to tackle this in the near future by using memory-based policy representations, such as RNNs to learn hidden state representations from a history of images alone.We believe our study is a stepping stone towards this goal.

Fig. 7 :
Fig. 7: A time-lapse t-SNE visualization of image embeddings used by our policy.We collect images along the flying trajectory on three different tracks.The blue dots represent image embeddings from augmentations during training time and the orange dots represent image embeddings from testtime disturbances.A: Hue change.B: Brightness change.C: Blue boxes.D: Random Objects.
III: Success rates of the student policy when adding Gaussian noises to the drone states.

TABLE II :
Success rate