Skip to content(if available)orjump to list(if available)

Show HN: I built a full mulimodal LLM by merging multiple models into one

kouteiheika

I clicked expecting a single full multimodal LLM made by merging multiple existing models into one like the title suggests (which sounds very interesting), and I found... a library which is an LLM router/calls a bunch of LLM web APIs and exposes that under a unified/easy to use interface?

With all due respect, sorry, but this title is very misleading. I'd expect "build an LLM" to mean, well, actually building an LLM, and while it's a very nice library it's definitely not what the title suggests.

willwade

You know - the word "multimodal" i think is being used badly here. Its Multi-Model - not Multimodal - which certainly suggests a completeley different thing

yoeven

It's a framework that uses the best part of each LLM, e.g. multimodal support from gemini with tool calling from gpt-4o and reasoning from o3-mini by chaining them dynamically. From a user perspective, there is no model selection or routing, just write the prompt or upload a file and it works so it feels like you're working with a single LLM but under the hood it does all this work to get you the best output :) Sorry if you felt it's misleading but I hope you give it a shot!

danielbln

The problem with that phrasing is that there is actual model merging, where you merge the weights. So people reading the title might (and apparently do) expect that, less so an LLM router.

vikramkr

Makes sense but the problem is that you're using words that already have specific meanings in the space, all related to creating one model with multiple functionalities. Merging meaning merging models into one model. Multi modal meaning one llm that handles multiple modes. The term you want is probably agent or framework or chain or something. Basically, what you describe is when it feels like you're only working with one model. What your title says is when you engineer specifically actually only one model, which is a distinct technical challenge.

yoeven

I 100% agree, this simulates a multimodal input and automatically handles the rest along with model selection by using a variation of techniques. It doesn't do this natively on the model level

madduci

And a similar product already exists, Langdock

upghost

I'll jump in before the haterade engine wakes up -- great bit of engineering work here! I can't imagine a better level of abstracting away the unnecessary stuff while still retaining that level of manual control.

The only thing I don't see is setup for local/in-house LLMs, but it's easy enough to spoof OpenAI calls if necessary.

yoeven

Thank you!! Local model is something I'm looking into with gguf files & llama.cpp, still pretty experimental, you can check out the branch here: https://github.com/JigsawStack/omiai/tree/feat/local-models

upghost

Nice! To be clear for my usecase I didn't mean calling the local LLM directly, rather simply being able to point to an OpenAI compatible API would be fine. It seems like wrangling a native model would be a lot of extra complexity, but you seem to have a very good abstraction story here. Actually I probably will take a peek at the code to see how you are doing these abstractions layers, because the user facing API is certainly very clean...!

Peretus

Whoa, great to see Yoeven's work here. I learned about JigsawStack when I applied for a role there and was super impressed with what he's built. We ended up having a call and he was able to tell me a bit more about what he's working on.

He is a friendly and super down-to-earth guy who has made some remarkably good progress on building a platform that just works. For instance, easily connecting a fine-tuned LLM that knows how to scrape content to a translation LLM and wrapping that up in a platform with a really good developer experience.

If you're interested in kind of thing, he also did a ShowHN last year on Dzero, a distributed SQLite database built on Cloudflare D1: https://news.ycombinator.com/item?id=40563729

FilipSivak

You clearly don't understand what multimodal means. Multimodal is for example new gemini where you can input green car and get the very same car, only with red paint. Multimodal LLM can do the edit in the latent space, which is the key.

Very misleading title, and you won't get away with it by using word "mulimodal" either.

yoeven

Multimodal is the ability to handle different type of inputs like images, pdf, text... You can do a quick google if you'd like to understand the meaning. Here is an article if it helps: https://www.splunk.com/en_us/blog/learn/multimodal-ai.html