Abstract
Multi-modal large language models struggle to jointly understand audio and visual signals in egocentric videos, but a new scalable data engine and dataset significantly improve their performance through targeted fine-tuning.
Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation (2025)
- FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs (2026)
- OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention (2026)
- PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence (2025)
- Active Perception Agent for Omnimodal Audio-Video Understanding (2025)
- VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning (2026)
- A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper