Abstract
Unmanned Aerial Vehicle (UAV)-based oceanic scene object detection is critical for maritime surveillance and Search-and-Rescue (SAR), yet existing benchmarks are predominantly land-centric, lacking category diversity and precise oriented annotations for maritime targets. To bridge this gap, we introduce UAV-Ocean, a large-scale maritime benchmark comprising 11,317 images and over 50,000 Oriented Bounding Box (OBB) annotations, covering 45 fine-grained categories across diverse altitudes, lighting conditions, and sea states. We further propose a Hierarchical Dynamic Aggregation Feature Pyramid Network (HDA-FPN)\footnote{The data set and code will be available at https://github.com/INDTLab/HDA-FPN upon the acceptance of the paper.}, to address the severe scale variations and semantic misalignment inherent in multi-scale fusion. Specifically, HDA-FPN integrates a Cross-Scale Dynamic Fusion (CSDF) module to leverage inter-level feature differences for adaptive spatial competition, and a Mid-level Guided Feature Aggregation (MGFA) module to redistribute contextual information centered on intermediate semantics. On top of the HDA-FPN, we adopt a novel variant of YOLOv11, namely, YOLO-HDA-FPN. Extensive experiments on the UAV-Ocean dataset demonstrate the superiority of our method. In particular, YOLO-HDA-FPN achieves an $mAP_{50}$ value of 65.7\%, outperforming the YOLOv11 baseline by 3.6\% while maintaining a real-time inference speed of 51.0 FPS. This work provides both a large-scale benchmark and an effective solution for challenging maritime detection tasks.