DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen, An-Zi Yen
College of Artificial Intelligence, National Yang Ming Chiao Tung University
Institute of Population Health Sciences, National Health Research Institutes
Department of Computer Science, National Yang Ming Chiao Tung University
ICASSP 2026
DaMO main result

Qualitative comparison on temporal reasoning in video-grounded QA. Given a temporal question grounded in a video clip, DaMO generates a more precise and temporally aligned response than Video-LLaMA and VTimeLLM, showcasing superior temporal understanding.

Abstract

Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision.\n We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed \textbf{Temporal-aware Fuseformer} employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details.\n We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision.\n Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

Framework overview

Framework of proposed method, DaMO.

Architecture of T-Fuseformer

Architecture of T-Fuseformer.

BibTeX

@misc{chiu2025damodataefficientmultimodalorchestrator,
      title={DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs}, 
      author={Bo-Cheng Chiu and Jen-Jee Chen and Yu-Chee Tseng and Feng-Chi Chen and An-Zi Yen},
      year={2025},
      eprint={2506.11558},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.11558}, 
}