Skip to content(if available)orjump to list(if available)

Qwen2.5-VL-32B: Smarter and Lighter

Qwen2.5-VL-32B: Smarter and Lighter

289 comments

·March 24, 2025

simonw

Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/

chaosprint

it seems that this free version "may use your prompts and completions to train new models"

https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free

do you think this needs attention?

wgd

That's typical of the free options on OpenRouter, if you don't want your inputs used for training you use the paid one: https://openrouter.ai/deepseek/deepseek-chat-v3-0324

overfeed

Is OpenRouter planning on distilling models off the prompts and responses from frontier models? That's smart - a little gross - but smart.

huijzer

Since we are on HN here, I can highly recommend open-webui with some OpenAI-compatible provider. I'm running with Deep Infra for more than a year now and am very happy. New models are usually available within one or two days after release. Also have some friends who use the service almost daily.

l72

I too run openweb-ui locally and use deepinfra.com as my backend. It has been working very well, and I am quite happy with deepinfra's pricing and privacy policy.

I have set up the same thing at work for my colleagues, and they find it better than openai for their tasks.

indigodaddy

Can open-webui update code on your local computer ala cursor etc?

unquietwiki

I'm using open-webui at home with a couple of different models. gemma2-9b fits in VRAM on a NV 3060 card + performs nicely.

eurekin

I've tried using it, but it's browser tab seems to peg one core to 100% after some time. Anyone else experienced it?

totetsu

And it’s quite easy to set up a Cloudflare tunnel to make your open-webui instance accessible online too just you

wkat4242

Yeah OpenWebUI is great with local models too. I love it. You can even do a combo, send the same prompt to local and cloud and even various providers and compare the results.

null

[deleted]

TechDebtDevin

Thats because its a 3rd party API someone is hosting and trying to arb the infra cost or mine training data, or maybe something even more sinister. I stay away from open router API's that aren't served by reputable well known companies, and even then...

madduci

As always, avoid using sensitive information and you are good to go

null

[deleted]

rcdwealth

[dead]

behnamoh

good grief! people are okay with it when OpenAI and Google do it, but as soon as open source providers do it, people get defensive about it...

chaosprint

no. it's nothing to do with deepseek. it's openrouter and providers there

londons_explore

I trust big companies far more with my data than small ones.

Big companies have so much data they won't be having a human look at mine specifically. Some small place probably has the engineer looking at my logs as user #4.

Also, big companies have security teams whose job is securing the data, and it won't be going over some unencrypted link to cloudflare because OP was too lazy to set up Https certs.

echelon

Pretty soon I won't be using any American models. It'll be a 100% Chinese open source stack.

The foundation model companies are screwed. Only shovel makers (Nvidia, infra companies) and product companies are going to win.

jsheard

I still don't get where the money for new open source models is going to come from once setting investor dollars on fire is no longer a viable business model. Does anyone seriously expect companies to keep buying and running thousands of ungodly expensive GPUs, plus whatever they spend on human workers to do labelling/tuning, and then giving away the spoils for free, forever?

Imustaskforhelp

I think it's market leadership which is just free word of mouth advertising which can then lead to consulting business or maybe they can cheek in some ads in llm directly oh boy you don't know.

Also I have seen that once a open source llm is released to public, though you can access it on any website hosting it, most people would still prefer it to be the one which created the model.

Deepseek released its revenue models and it's crazy good.

And no they didn't have full racks of h100.

Also one more thing. Open source has always had an issue of funding.

Also they are not completely open source, they are just open weights, yes you can fine tune them but from my limited knowledge, there is some limitations of fine tuning so owning that training data proprietary also helps fund my previous idea of consulting other ai.

Yes it's not a much profitable venture,imo it's just a decently profitable venture, but the current hype around ai is making it lucrative for companies.

Also I think this might be a winner takes all market which increases competition but in a healthy way.

What deepseek did with releasing the open source model and then going out of their way to release some other open source projects which themselves could've been worth a few millions (bycloud said it), helps innovate ai in general.

pants2

There are lots of open-source projects that took many millions of dollars to create. Kubernetes, React, Postgres, Chromium, etc. etc.

This has clearly been part of a viable business model for a long time. Why should LLM models be any different?

mitthrowaway2

Maybe from NVIDIA? "Commoditize your product's complement".

https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

zamadatix

Once setting investment dollars on fire is no longer viable it'll probably be because scaling died anyways so what's the rush to have a dozen new frontier models each year.

pizzly

One possibility. Certain countries will always be able to produce open models cheaper than others. USA and Europe probably won't be able. However, due to national security and wanting to promote their models overseas instead of letting their competitors promote theirs, the governments of USA and Europe will subsidize models which will lead their competitors to (further?) subsidies. There is a promotional aspect as well, just like Hollywood governments will use their open source models to promote their ideology.

lumost

Product, and infra companies may continue to open these models by virtue that they need to continue improving their product. Omni chat app is a great product.

finnjohnsen2

ads again. somehow. its like a law of nature.

theptip

Yeah, this is the obvious objection to the doom. Someone has to pay to train the model that all the small ones distill from.

Companies will have to detect and police distilling if they want to keep their moat. Maybe you have to have an enterprise agreement (and arms control waiver) to get GPT-6-large API access.

piokoch

"The foundation model companies are screwed." Not really, they can either make API access expensive or resign from exposing APIs and offer their custom products. Open Source models are great, but you need powerful hardware to run them, surely it will not be a smartphone, at least in the nearest future.

Imustaskforhelp

Yes I also believe the same though I only believe in either grok , gemini or claude ai of the western world.

Gemini isn't too special , it's actually just comparable to deepseek / less than deepseek but it is damn fast so maybe forget gemini for true tasks.

Grok / gemini can be used as a deep research model which I think I like ? Grok seems to have just taken the deepseek approach but just scaled it by their hyper massive gpu cluster, I am not sure I think that grok can also be replaced.

What I truly believe in is claude.

I am not sure but claude really feels good for coding especially.

For any other thing I might use something like deepseek / chinese models

I used cerebras.ai and holy moly they are so fast , I used the deepseek 70 b model , it is still something incredibly fast and my time matters so I really like the open source way so that companies like cereberas can focus on what they do best.

I am not sure about nvidia though. Nvidia seems so connected to the western ai that deepseek improvements impact nvidia.

I do hope that nvidia cheapens the price of gpu though I don't think they have much incentive.

AlexCoventry

IMO, people will keep investing in this because whoever accomplishes the first intelligence explosion is going to have the potential for massive influence over all human life.

fsndz

indeed. open source will win. sam Altman was wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong

buyucu

OpenAI is basically a zombie company at this point. They could not make a profit even when they were the only player in town, it's now a very competitive landscape.

refulgentis

I've been waiting since November for 1, just 1*, model other than Claude than can reliably do agentic tool call loops. As long as the Chinese open models are chasing reasoning and benchmark maxxing vs. mid-2024 US private models, I'm very comfortable with somewhat ignoring these models.

(this isn't idle prognostication hinging on my personal hobby horse. I got skin in the game, I'm virtually certain I have the only AI client that is able to reliably do tool calls with open models in an agentic setting. llama.cpp got a massive contribution to make this happen and the big boys who bother, like ollama, are still using a dated json-schema-forcing method that doesn't comport with recent local model releases that can do tool calls. IMHO we're comfortably past a point where products using these models can afford to focus on conversational chatbots, thats cute but a commodity to give away per standard 2010s SV thinking)

* OpenAI's can but are a little less...grounded?...situated? i.e. it can't handle "read this file and edit it to do $X". Same-ish for Gemini, though, sometimes I feel like the only person in the world who actually waits for the experimental models to go GA, as per letter of the law, I shouldn't deploy them until then

anon373839

A but of a tangent, but what’re your thoughts on code agents compared to the standard “blobs of JSON” approach? I haven’t tried it myself, but it does seem like it would be a better fit for existing LLMs’ capabilities.

cess11

You mean like https://manusai.ai/ is supposed to function?

throwawaymaths

is there some reason you cant train a 1b model to just do agentic stuff?

ilrwbwrkhv

Both of them are better than any American models. Both for reasoning, agentic, fine tuning etc.

th0ma5

Why do you keep promoting your blog on every LLM post?

rudedogg

I think they didn’t want to rewrite their post. It’s more substantial and researched than any comment here, and all their posts are full of information. I think they should get a pass, and calling it self-promotion is a stretch.

simonw

Because I want people to read it. I only promote it if I think it's useful and relevant.

qingcharles

You're good. Disregard the haters.

th0ma5

I think you need to realize your fans don't have the same intent as you. You should ask your audience what they want you may be surprised.

jschoe

And it still can't answer this:

Q: "9.11 and 9.9, which one is larger?"

A: "To determine which number is larger between 9.11 and 9.9, let's compare them step by step. Both numbers have the same whole number part: 9. 9.11 has a 1 in the tenths place. 9.9 has a 9 in the tenths place. Since 9 (from 9.9) is greater than 1 (from 9.11), we can conclude that 9.9 is larger than 9.11."

"Final Answer: 9.9"

I don't think anything else needs to be said here.

vbezhenar

But that’s correct. 9.9 = 9.90 > 9.11. Seems that it answered the question absolutely correctly.

javchz

He's using Semantic versioning/s

AuryGlenz

I suggest we’ve already now passed what shall be dubbed the jschoe test ;)

sebastiennight

I will now refer to this as the jschoe test in my writing and publications as well!

It's interesting to think that maybe one of the most realistic consequences of reaching artificial superintelligence will be when its answers start wildly diverging from human expectations and we think it's being "increasingly wrong".

manaskarekar

jschoe's post is actually a Turing test for us. :)

(just kidding jschoe)

oefrha

I’ve legit seen a heated online debate with hundreds of comments about this question (maybe not the exact numbers), and I don’t think most participants were memeing. People are that bad at math. It’s depressing.

aurareturn

+1 to Deepseek

-1 to humanity

yencabulator

Based on the presented reasoning, that means humanity wins! Yay!

null

[deleted]

MiiMe19

Sorry, I don't quite see what is wrong here.

manaskarekar

Parent is thinking Semantic Versioning.

dangoodmanUT

9.9-9.11 =0.79

Might want to check your math? Seems right to me

keyle

9.9 is larger than 9.11. This right here is the perfect example of the dunning-kruger effect.

Maybe try rephrase your question to "which version came later, 9.9 or 9.11".

simonw

This model is available for MLX now, in various different sizes.

I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this:

  uv run --with 'numpy<2' --with mlx-vlm \
    python -m mlx_vlm.generate \
      --model mlx-community/Qwen2.5-VL-32B-Instruct-4bit \
      --max-tokens 1000 \
      --temperature 0.0 \
      --prompt "Describe this image." \
      --image Mpaboundrycdfw-1.png
That downloaded an ~18GB model and gave me a VERY impressive result, shown at the bottom here: https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/

john_alan

Does quantised MLX support vision though?

Is UV the best way to run it?

dphnx

uv is just a Python package manager. No idea why they thought it was relevant to mention that

stavros

Because that one-liner will result in the model instantly running on your machine, which is much more useful than trying to figure out all the dependencies, invariably failing, and deciding that technology is horrible and that all you ever wanted was to be a carpenter.

ggregoire

We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our mind. We ask it to find something in an image and we get a response in like half a second with a 4090 and it's most of the time correct. What's even more mind blowing is that when we ask it to extract any entity name from the image, and the entity name is truncated, it gives us the complete name without even having to ask for it (e.g. "Coca-C" is barely visible in the background, it will return "Coca-Cola" on its own). And it does it with entities not as well known as Coca-Cola, and with entities only known in some very specific regions too. Haven't looked back to Llama or any other vision models since we tried Qwen.

Alifatisk

Ever since I switched to Qwen as my go to, it's been a bliss. They have a model for many (if not all) cases. No more daily quota! And you get to use their massive context window (1M tokens).

Hugsun

How are you using them? Who is enforcing the daily quota?

Alifatisk

I use them through chat.qwenlm.ai, what's nice is that you can run your prompt through 3 different modes in parallel to see which suits the best for that case.

The daily quota I spoke about is chatgpt and claude, those are very limited on the usage (for free users at least, understandable), while on Qwen, I have felt likeI am abusing it with how much I use it. It's very versatile in the sense that it has capabilities like image generation, video generation, massive context window, both visual and textual reasoning all in one place.

Alibaba is really doing something amazing here.

exe34

what do you use to serve it, ollama or llama.cpp or similar?

simonw

32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).

faizshah

I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.

I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.

I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.

I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.

Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.

If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.

wetwater

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.

coder543

"B" just means "billion". A 7B model has 7 billion parameters. Most models are trained in fp16, so each parameter takes two bytes at full precision. Therefore, 7B = 14GB of memory. You can easily quantize models to 8 bits per parameter with very little quality loss, so then 7B = 7GB of memory. With more quality loss (making the model dumber), you can quantize to 4 bits per parameter, so 7B = 3.5GB of memory. There are ways to quantize at other levels too, anywhere from under 2 bits per parameter up to 6 bits per parameter are common.

There is additional memory used for context / KV cache. So, if you use a large context window for a model, you will need to factor in several additional gigabytes for that, but it is much harder to provide a rule of thumb for that overhead. Most of the time, the overhead is significantly less than the size of the model, so not 2x or anything. (The size of the context window is related to the amount of text/images that you can have in a conversation before the LLM begins forgetting the earlier parts of the conversation.)

The most important thing for local LLM performance is typically memory bandwidth. This is why GPUs are so much faster for LLM inference than CPUs, since GPU VRAM is many times the speed of CPU RAM. Apple Silicon offers rather decent memory bandwidth, which makes the performance fit somewhere between a typical Intel/AMD CPU and a typical GPU. Apple Silicon is definitely not as fast as a discrete GPU with the same amount of VRAM.

That's about all you need to know to get started. There are obviously nuances and exceptions that apply in certain situations.

A 32B model at 5 bits per parameter will comfortably fit onto a 24GB GPU and provide decent speed, as long as the context window isn't set to a huge value.

faizshah

Go to r/LocalLLAMA they have the most info. There’s also lots of good YouTube channels who have done benchmarks on Mac minis for this (another good value one with student discount).

Since you’re a student most of the providers/clouds offer student credits and you can also get loads of credits from hackathons.

disgruntledphd2

MacBook with 64gb RAM will probably be the easiest. As a bonus, you can train pytorch models on the built in GPU.

It's really frustrating that I can't just write off Apple as evil monopolists when they put out hardware like this.

p_l

Generally, unquantized - double the number and that's the amount of VRAM in GB you need + some extra, because most models use fp16 weights so it's 2 bytes per parameter -> 32B parameters = 64GB

typical quantization to 4bit will cut 32B model into 16GB of weights plus some of the runtime data, which makes it possibly usable (if slow) on 16GB GPU. You can sometimes viably use smaller quantizations, which will reduce memory use even more.

randomNumber7

Yes. You multiply the number of parameters with the number of bytes per parameter and compare it with the amount of GPU memory (or CPU RAM) you have.

regularfry

Qwq:32b + qwen2.5-coder:32b is a nice combination for aider, running locally on a 4090. It has to swap models between architect and edit steps so it's not especially fast, but it's capable enough to be useful. qwen2.5-coder does screw up the edit format sometimes though, which is a pain.

pixelHD

what spec is your local mac?

Tepix

32B is also great for two 24GB GPUs if you want a nice context size and/or Q8 quantization which is usually very good.

wetwater

I've only recently started looking into running these models locally on my system. I have limited knowledge regarding LLMs and even more limited when it comes to building my own PC.

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally?

TechDebtDevin

VRAM Required = Number of Parameters (in billions) × Number of Bytes per Parameter × Overhead[0].

[0]: https://twm.me/posts/calculate-vram-requirements-local-llms/

manmal

Don’t forget to add a lot of extra space if you want a usable context size.

wetwater

Thats neat! thanks

int_19h

I don't think there's any local model other than full-sized DeepSeek (not distillations!) that is on the level of the original GPT-4, at least not in reasoning tasks. Scoreboards lie.

That aside, QwQ-32 is amazingly smart for its size.

clear_view

32B don't fully fit 16GB of VRAM. Still fine for higher quality answers, worth the extra wait in some cases.

abraxas

Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?

manmal

At FP16 you‘d need 64GB just for the weights, and it‘d be 2x as slow as a Q8 version, likely with little improvement. You‘ll also need space for attention and context etc, so 80-100GB (or even more) VRAM would be better.

Many people „just“ use 4x consumer GPUs like the 3090 (24GB each) which scales well. They’d probably buy a mining rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe risers, 1600W PSU (might need to limit the GPUs to 300W), and 128GB RAM. Depending what you pay for the GPUs that‘ll be 3.5-4.5k

elorant

You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.

osti

Are 5090's able to run 32B models?

regularfry

The 4090 can run 32B models in Q4_K_M, so yes, on that measure. Not unquantised though, nothing bigger than Q8 would fit. On a 32GB card you'll have more choices to trade off quantisation against context.

redrove

Or quantized on a 4090!

buyucu

I prefer 24b because it's the largest model I can run on a 16GB laptop :)

101008

Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable?

Gathering6678

Valuation can depend on lots of different things, including hype. However, it ultimately comes down to an estimated discounted cash flow from the future, i.e. those who buy their shares (through private equity methods) at the current valuation believe the company will earn such and such money in the future to justify the valuation.

neither_color

ChatGPT's o1 is still really good and the free options are not compelling enough to switch if you've been using it for a while. They've positioned themselves to be a good mainstream default.

rafaelmn

Because what would seem like a tiny difference in those benchmark graphs is the difference between worth paying for and complete waste of time in practice

barbarr

It's user base and brand. Just like with Pepsi and Coca Cola. There's a reason OpenAI ran a Super Bowl ad.

TechDebtDevin

Most "normies" I know only recognize ChatGPT with AI, so for sure, brand recognition is the only thing that matters.

101008

Yeah but cheaper alternatives (and open source and local ones) it would be super easy for most of the customers to migrate to a different provider. I am not saying they don't provide any value, but it's like paid software vs open source alternative. Open source alternative ends up imposing, especially among tech people.

csomar

Their valuation is not marked to market. We know their previous round valuation, but at this point it is speculative until they go through another round that will mark them again.

That being said, they have a user base and integrations. As long as they stay close or a bit ahead of the Chinese models they'll be fine. If the Chinese models significantly jumps ahead of them, well, then they are pretty much dead. Add open source to the mix and they become history.

elorant

The average user won't self-host a model.

hobofan

The competition isn't self-hosting. If you can just pick a capable model from any provider inference just turns into a infrastructure/PaaS game -> The majority of the profits will be captured by the cloud providers.

epolanski

...yet

8n4vidtmkvmk

I'm not sure how it'll ever make sense unless you need a lot of customizations or care a lot about data leaks.

For small guys and everyone else.. it'll probably be cost neutral to keep paying OpenAi, Google etc directly rather than paying some cloud provider to host an at best on-par model at equivalent prices.

Workaccount2

Because they offer extremely powerful models at pretty modest prices.

The hardware for a local model would cost years and years of a $20/mo subscription, would output lower quality work, and would be much slower.

3.7 Thinking is an insane programming model. Maybe it cannot do an SWE's job, but it sure as hell can write functional narrow-scope programs with a GUI.

mirekrusin

For coding and other integrations people pay per token on api key, not subscription. Claude code costs few $ per task on your code - it gets expensive quite quickly.

tempoponet

But something comparable to a local hosted model in the 32-70b range costs pennies on the dollar compared to Claude, will be 50x faster than your gpu, and with a much larger context window.

Local hosting on GPU only really makes sense if you're doing many hours of training/inference daily.

seydor

People cannot normally invest in their competitors.

It's not unlikely that chinese products may be banned / tarriff'd

FreakyT

There are non-Chinese open LLMs (Mistral, LLama, etc), so I don't think that explains it.

Arcuru

Does anyone know how making the models multimodal impacts their text capabilities? The article is claiming this achieves good performance on pure text as well, but I'm curious if there is any analysis on how much impact it usually has.

I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.

kmacdough

I am having a hard time finding controlled testing, but the premise is straightforward: different modalities encourage different skills and understandings. Text builds up more formal idea tokenization and strengthens logic/reasoning while images require it learns a more robust geometric intuition. Since these learnings are applied to the same latent space, the strengths can be cross-applied.

The same applies to humans. Imagine a human who's only life involved reading books in a dark room, vs one who could see images vs one who can actually interact.

netdur

My understanding is that in multimodal models, both text and image vectors align to the same semantic space, this alignment seems to be the main difference from text-only models."

lysace

To clarify: Qwen is made by Alibaba Cloud.

(It's not mentioned anywhere in the blog post.)

jauntywundrkind

Wish I knew better how to estimate what sized video card one needs. HuggingFace link says this is bfloat16, so at least 64GB?

I guess the -7B might run on my 16GB AMD card?

zamadatix

https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

That will help you quickly calculate the model VRAM usage as well as the VRAM usage of the context length you want to use. You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model (unquantized)" field. Funnily enough the calculator lacks the option to see without quantizing the model, usually because nobody worried about VRAM bothers running >8 bit quants.

azinman2

Except when it comes to deepseek

zamadatix

For others not as familiar, this is pointing out DeepSeek-v3/DeepSeek-R1 are natively FP8 so selecting "Q8_0" aligns with not selecting quantization for that model (though you'll need ~1 TB of memory to use these model unquantized at full context). Importantly, this does not apply to the "DeepSeek" distills of other models, which retain natively being the same as the base model they distill.

I expect more and more worthwhile models to natively have <16 bit weights as time goes on but for the moment it's pretty much "8 bit DeepSeek and some research/testing models of various parameter width".

xiphias2

I wish they would start producing graphs with quantized version performances as well. What matters is RAM/bandwidth vs performance, not number of parameters.

wgd

You can run 4-bit quantized version at a small (though nonzero) cost to output quality, so you would only need 16GB for that.

Also it's entirely possible to run a model that doesn't fit in available GPU memory, it will just be slower.

clear_view

deepseek-r1:14b/mistral-small:24b/qwen2.5-coder:14b fit 16GB VRAM with fast generation. 32b versions bleed into RAM and take a serious performance hit but still usable.

gatienboquet

So today is Qwen. Tomorrow a new SOTA model from Google apparently, R2 next week.

We haven't hit the wall yet.

zamadatix

Qwen 3 is coming imminently as well https://github.com/huggingface/transformers/pull/36878 and it feels like Llama 4 should be coming in the next month or so.

That said none of the recent string of releases has done much yet to "smash a wall", they've just met the larger proprietary models where they already were. I'm hoping R2 or the like really changes that by showing ChatGPT 3->3.5 or 3.5->4 level generational jumps are still possible beyond the current state of the art, not just beyond current models of a given size.

YetAnotherNick

> met the larger proprietary models where they already were

This is smashing the wall.

Also if you just care about breaking absolute numbers, OpenAI released 4.5 a month back which is SOTA in base model, planning to release O3 full in maybe a month, and Deepseek released new V3 which is again SOTA in many aspects.

OsrsNeedsf2P

> We haven't hit the wall yet.

The models are iterative improvements, but I haven't seen night and day differences since GPT3 and 3.5

anon373839

Yeah. Scaling up pretraining and huge models appears to be done. But I think we're still advancing the frontier in the other direction -- i.e., how much capability and knowledge can we cram into smaller and smaller models?

Davidzheng

Tbh such a big jump from current capability would be ASI already

YetAnotherNick

Because 3.5 has a new capability which is following instructions. Right now we are in 3.5 range in conversation AI and native image generation, both of which feels magical.

intalentive

Asymptotic improvement will never hit the wall

nwienert

We've slid into the upper S curve though.

tomdekan

Any more info on the new Google model?

behnamoh

Google's announcements are mostly vaporware anyway. Btw, where is Gemini Ultra 1? how about Gemini Ultra 2?

karmasimida

It is already on the LLM arena right, codename nebula? But you are right they can fuck up their releases royally.

aoeusnth1

I guess they don’t do ultras anymore, but where was the announcement for it? What other announcement was vaporware?

michaelt

Has anyone successfully run a quantized version of any of the Qwen2.5-VL series of models?

I've run the smallest model in non-quantized format, but when I've tried to run a AWQ version of one of the bigger models I've struggled to find a combination of libraries that works right - even though it should fit on my GPU.

naasking

I found Qwen never completed answering my standard coding task that I ask to check a model. Claude did great, DeepSeek R1 did well.

cryptocrat7

there should be a way to share these prompts + tools through visuals