AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation
in the Wild

Xiaolou Sun1,2*, Wufei Si2*, Wenhui Ni1,2*, Yuntian Li2*, Dongming Wu3, Fei Xie6, Runwei Guan4†, He-Yang Xu1, Henghui Ding5†, Yuan Wu2, Yutao Yue4,
Yongming Huang1,2†, Hui Xiong4
1Southeast University, 2Purple Mountain Labs, 3MMLab, The Chinese University of Hong Kong, 4The Hong Kong University of Science and Technology (Guangzhou), 5Fudan University, 6Shanghai Jiao Tong University
ICLR 2026
*Equal Contribution   Corresponding Author
AutoFly Teaser Figure

Analysis of previous methods and our AutoFly: Left: Previous methods rely on dedicated, step-by-step instructions that specify predetermined flight paths with explicit waypoints and maneuvers. Right: Our AutoFly performs autonomous navigation with concise natural language instructions, and coarse positional or directional information.

Abstract

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.

Video Presentation

We have constructed a complete VLA model-based pipeline for UAV autonomous navigation, which encompasses the entire workflow of dataset collection, supervised fine-tuning of the VLA model, model acceleration, and physical UAV deployment, and have conducted flight tests across a variety of simulated and real-world scenarios.

Autonomous Navigation Dataset

Research result visualization

We construct 12 diverse simulated environments using AirSim for training and evaluation, totaling over 13K episodes and 2.5M image-language-action triplets. For object recognition tasks, we strategically position 60 carefully selected object instances at environment boundaries, with each scenario containing 3-5 distractor objects to challenge the model's recognition and reasoning capabilities.

VLA Model for Autonomous Navigation

Research result visualization

Framework of AutoFly. AutoFly takes RGB observations and linguistic instructions as inputs and directly outputs high-level actions. These actions, combined with initial actions derived from coarse-grained positional or directional information, form action sequences.

Model Acceleration and Deployment

Research result visualization

Our model is implemented on a remote server and communicates with the robot via a local area network (LAN). Our system employs a multi-process parallel inference methodology.

Performance Visualization of AutoFly

Real Outdoor Structured Environment.

Real Indoor Unstructured Environment.

Cluttered Cylinder Scene.

Dynamic and Cluttered Cylinder Scene.

Dense Forest Scene.

Dense Stone Scene.

Cluttered Town Scene.

Cluttered Town Scene with Target Distractors.

Obstacle Avoidance Task Only.

Recognition Task Only.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}