PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation
in the Wild

Xiaolou Sun^1,2*, Wufei Si^2*, Wenhui Ni^1,2*, Yuntian Li^2*, Dongming Wu³, Fei Xie⁶, Runwei Guan^4†, He-Yang Xu¹, Henghui Ding^5†, Yuan Wu², Yutao Yue⁴,
Yongming Huang^1,2†, Hui Xiong⁴

¹Southeast University, ²Purple Mountain Labs, ³MMLab, The Chinese University of Hong Kong, ⁴The Hong Kong University of Science and Technology (Guangzhou), ⁵Fudan University, ⁶Shanghai Jiao Tong University
ICLR 2026
^*Equal Contribution ^†Corresponding Author

Paper Dataset Code Model

Analysis of previous methods and our AutoFly: Left: Previous methods rely on dedicated, step-by-step instructions that specify predetermined flight paths with explicit waypoints and maneuvers. Right: Our AutoFly performs autonomous navigation with concise natural language instructions, and coarse positional or directional information.

Abstract

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.

Video Presentation

We have constructed a complete VLA model-based pipeline for UAV autonomous navigation, which encompasses the entire workflow of dataset collection, supervised fine-tuning of the VLA model, model acceleration, and physical UAV deployment, and have conducted flight tests across a variety of simulated and real-world scenarios.

Autonomous Navigation Dataset

We construct 12 diverse simulated environments using AirSim for training and evaluation, totaling over 13K episodes and 2.5M image-language-action triplets. For object recognition tasks, we strategically position 60 carefully selected object instances at environment boundaries, with each scenario containing 3-5 distractor objects to challenge the model's recognition and reasoning capabilities.

VLA Model for Autonomous Navigation

Framework of AutoFly. AutoFly takes RGB observations and linguistic instructions as inputs and directly outputs high-level actions. These actions, combined with initial actions derived from coarse-grained positional or directional information, form action sequences.

Model Acceleration and Deployment

Our model is implemented on a remote server and communicates with the robot via a local area network (LAN). Our system employs a multi-process parallel inference methodology.

Performance Visualization of AutoFly

Real Outdoor Structured Environment.

Real Indoor Unstructured Environment.

Cluttered Cylinder Scene.

Dynamic and Cluttered Cylinder Scene.

Dense Forest Scene.

Dense Stone Scene.

Cluttered Town Scene.

Cluttered Town Scene with Target Distractors.

Obstacle Avoidance Task Only.

Recognition Task Only.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}

More Works from Xiaolou Sun

Siamese Transformer Network: Building an autonomous real-time target tracking system for UAV

CLAT: Convolutional Local Attention Tracker for Real-time UAV Target Tracking System with Feedback Information

Dynamic Compact Consensus Tracking for Aerial Robots

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild