
Alibaba's Qwen 3.5-Omni Pushes Open-Source Multimodal AI to New Heights
Alibaba releases Qwen 3.5-Omni, a native omnimodal model processing 10+ hours of audio and 400 seconds of video, advancing the open-source multimodal frontier.
Alibaba has released Qwen 3.5-Omni, a native omnimodal large language model that pushes the boundaries of what open-source multimodal AI can do. The model can process over ten hours of audio and 400 seconds of 720p video natively — capabilities that were previously limited to closed-source frontier models.
The "omnimodal" designation means Qwen 3.5-Omni processes text, images, audio, and video through a unified architecture rather than separate modules stitched together. This approach generally produces more coherent cross-modal reasoning — understanding a video's visual content in relation to its audio track, for example.
The release intensifies competition in the open-source multimodal space, where Google's Gemma 4 (with native vision and audio) and Meta's Llama models are also vying for developer adoption. Alibaba's approach of offering maximum multimodal capability at zero licensing cost is designed to drive adoption across Asia's developer ecosystem, particularly in China and Southeast Asia where Alibaba Cloud has significant market presence.
Qwen 3.5-Omni arrives as Alibaba continues to expand its enterprise AI offerings, including the Qwen-based agentic AI platform for business customers announced earlier this year.
Newsletter
Get Lanceum in your inbox
Weekly insights on AI and technology in Asia.


