SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
Abstract
SocialOmni presents a benchmark for evaluating social interactivity in omni-modal large language models across speaker identification, interruption timing, and natural interruption generation, revealing gaps between perceptual accuracy and conversational competence.
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
Community
New OmniModel benchmark on social interaction.
🔗Github: github.com/MAC-AutoML/SocialOmni
🔗Dataset: huggingface.co/datasets/alexisty/SocialOmni
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation (2026)
- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos (2026)
- FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs (2026)
- PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios (2026)
- Hello-Chat: Towards Realistic Social Audio Interactions (2026)
- LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks (2026)
- D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper