Qwen3-Omni: Native Omni AI model for text, image and video

state_less

Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.

The Chinese are going to end up owning the AI market if the American labs don't start competing on open weights. Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data. What a turn of events!

tedivm

This is exactly what I do. I have two 3090s at home, with Qwen3 on it. This is tied into my Home Assistant install, and I use esp32 devices as voice satellites. It works shockingly well.

bilbo0s

Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it

Wouldn't worry about that, I'm pretty sure the government is going to ban running Chinese tech in this space sooner or later. And we won't even be able to download it.

Not saying any of the bans will make any kind of sense, but I'm pretty sure they're gonna say this is a "strategic" space. And everything else will follow from there.

Download Chinese models while you can.

simonw

You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.

It has an entertaining selection of different voices, including:

*Dylan* - A teenager who grew up in Beijing's hutongs

*Peter* - Tianjin crosstalk, professionally supporting others

*Cherry* - A sunny, positive, friendly, and natural young lady

*Ethan* - A sunny, warm, energetic, and vigorous boy

*Eric* - A Sichuan Chengdu man who stands out from the crowd

*Jada* - The fiery older sister from Shanghai

flockonus

The voices are really fun, thanks for the laughs :)

indigodaddy

I only see Omni Flash, is that the one?

simonw

The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.

I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.

a_e_k

That's at BF16, so it should fit fairly well on 24GB GPUs after quantization to Q4, I'd think. (Much like the other 30B-A3B models in the family.)

I'm pretty happy about that - I was worried it'd be another 200B+.

growthwtf

A fun project for somebody who has more time than myself would be to see if they can get it working with the new Mojo stuff from yesterday for Apple. I don't know if the functionality would be fully baked out enough yet to actually do the port successfully, but it would be an interesting try.

dcreater

is there an inference engine for this on macos?

chisleu

Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.

https://www.youtube.com/watch?v=_zdOrPju4_g

vunderba

Neat. I threw a couple simple audio clips at it and it was able to at least recognize the instrumentation (piano, drums, etc). I haven't seen a lot of multimodal LLM focus around recognizing audio outside of speech, so I'd love to see a deep dive of what the SOTA is.

edude03

The qwen thinker/speaker architecture is really fascinating and is more in line with how I imagine human multi modality works - IE, a picture of an apple, the text a p p l e and the sound all map to the same concept without going to text first.

adastra22

Isn’t that how all LLMs work?

simonw

The existing vision LLMs all work like this, which is most of the major models these days.

Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.

I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.

adastra22

What I mean is that all processing in an LLM occurs in state space. The next-token prediction is the very last step.

wills_forward

https://x.com/whowillrickwill/status/1920723985311903767

hadlock

Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to your core LLM. A couple can input speech, or output speech, but not both. It looks like they have at least 3 variants in the ~32b range.

Depending on the architecture this is something you could feasibly have in your house in a couple of years or in an expensive "ai toaster"

CamperBob2

Seems like a big win for language learning, if nothing else. Also seems possible to run locally, especially once the unsloth guys get their hands on it.

nmitchko

Next steps for AI in general:

  - additional modalities
  - Faster FPS (inferences per second)
  - Reaction time tuning (latency vs quality tradeoff) for visual and audio inputs/outputs
  - built-in planning modules in the architecture (think premotor frontal lobe)
  - time awareness during inference (towards an always inferring / always learning architecture)

neilmovva

The multilingual example in the launch graphic has Qwen3 producing the text:

> "Bonjour, pourriez-vous me dire comment se rendreà la place Tian'anmen?"

translation: "Hello, could you tell me how to get to Tiananmen Square?"

a bold choice!

ripped_britches

Westerners only know it from the massacre but it’s actually just like Times Square for them

OJFord

Not really, it's a significant place which is why the protest (and hence massacre) was there, so especially for Chinese people (I expect) merely referencing it doesn't so immediately refer to the massacre, they have plenty of other connotations for it.

e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)

Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.

No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.

em500

Not to mention, Tiananmen Square is one of the major tourist destinations in Beijing (similar to National Mall in Washington DC), for both domestic and foreign visitors.

HN

Qwen3-Omni: Native Omni AI model for text, image and video

Qwen3-Omni: Native Omni AI model for text, image and video