Tabby: Self-hosted AI coding assistant

153 comments

·January 12, 2025

st3fan

The demo on the homepage for the completion of the findMaxElement function is a good example of what is to come. Or maybe where we are at now?

The six lines of Python suggested for that function can also be replaced with a simple “return max(arr)”. The suggested code works but is absolute junior level.

I am terrified of what is to come. Not just horrible code but also how people who blindly “autocomplete” this code are going to stall in their skill level progress.

You may score some story points but did you actually get any better at your craft?

tippytippytango

This is self correcting. Code of this quality won't let you ship things. You are forced to understand the last 20%-30% of details the LLM can't help you with to pass all your tests. But, it also turns out, to understand the 20% of details the LLM couldn't handle, you need to understand the 80% the LLM could handle.

I'm just not worried about this, LLMs don't ship.

tyingq

In the case where it write functionally "good enough" code that performs terribly, it rewards the LLM vendor...since the LLM vendor is also often your IaC vendor. And now you need to buy more infra.

HPsquared

That's one hell of a synergy. Win-win-lose

grahamj

I sense a new position coming up: slop cleanup engineer

grahamj

This needs to be shouted from the rooftops. If you could do it yourself then LLMs can be a great help, speeding things up, offering suggestions and alternatives etc.

But if you’re asking for something you don’t know how to do you might end up with junk and not even know it.

cootsnuck

But if that junk doesn't work (which it likely won't for any worthwhile problem) then you have to get it working. And to get it working you almost always have to figure out how the junk code works. And in that process I've found is where the real magic happens. You learn by fixing, pruning, optimizing.

I think there's a whole meta level of the actual dynamic between human<>LLM interactions that is not being sufficiently talked about. I think there's, potentially, many secondary benefits that can come from using them simply due to the ways you have to react to their outputs (if a person decides to rise to that occasion).

shriek

Wait till they come with auto review/merge agents, or maybe there already is. gulp

shcheklein

On the other hand it might become a next level of abstraction.

Machine -> Asm -> C -> Python -> LLM (Human language)

It compiles human prompt into some intermediate code (in this case Python). Probably initial version of CPython was not perfect at all, and engineers were also terrified. If we are lucky this new "compiler" will be becoming better and better, more efficient. Never perfect, but people will be paying the same price they are already paying for not dealing directly with ASM.

sdesol

> Machine -> Asm -> C -> Python -> LLM (Human language)

Something that you neglected to mention is, with every abstraction layer up to Python, everything is predictable and repeatable. With LLMs, we can give the exact same instructions, and not be guaranteed the same code.

theptip

I’m not sure why that matters here. Users want code that solves their business need. In general most don’t care about repeatability if someone else tries to solve their problem.

The question that matters is: can businesses solve their problems cheaper for the same quality, or at lower quality while beating the previous Pareto-optimal cost/quality frontier.

compumetrika

LLMs use pseudo-random numbers. You can set the seed and get exactly the same output with the same model and input.

zurn

> > Machine -> Asm -> C -> Python -> LLM (Human language)

> Something that you neglected to mention is, with every abstraction layer up to Python, everything is predictable and repeatable.

As long as you consider C and dragons flying out of your nose predictable.

(Insert similar quip about hardware)

zajio1am

There is no reason to assume that say C compiler generates the same machine code for the same source code. AFAIK, a C compiler that chooses randomly between multiple C-semantically equivalent sequences of instructions is a valid C compiler.

CamperBob2

With LLMs, we can give the exact same instructions, and not be guaranteed the same code.

That's something we'll have to give up and get over.

See also: understanding how the underlying code actually works. You don't need to know assembly to use a high-level programming language (although it certainly doesn't hurt), and you won't need to know a high-level programming language to write the functional specs in English that the code generator model uses.

I say bring it on. 50+ years was long enough to keep doing things the same way.

SkyBelow

Even compiling code isn't deterministic given different compilers and different items installed on a machine can influence the final resulting code, right? Ideally they shouldn't have any noticeable impact, but in edge cases it might, which is why you compile your code once during a build step and then deploy the same compiled code to different environments instead of compiling it per environment.

jsjohnst

> With LLMs, we can give the exact same instructions, and not be guaranteed the same code.

Set temperature appropriately, that problem is then solved, no?

12345hn6789

assuming you have full control over which compiler youre using for each step ;)

What's to say LLMs will not have a "compiler" interface in the future that will reign in their variance

vages

It may be a “level of abstraction”, but not a good one, because it is imprecise.

When you want to make changes to the code (which is what we spend most of our time on), you’ll have to either (1) modify the prompt and accept the risk of using the new code or (2) modify the original code, which you can’t do unless you know the lower level of abstraction.

MVissers

Yup!

No goal to become a programmer– But I like to build programs.

Build a rather complex AI-ecosystem simulator with me as the director and GPT-4 now Claude 3.5 as the programmer.

Would never have been able to do this beforehand.

saurik

I think there is a big difference between an abstraction layer that can improve -- one where you maybe write "code" in prompts and then have a compiler build through real code, allowing that compiler to get better over time -- and an interactive tool that locks bad decisions autocompleted today into both your codebase and your brain, involving you still working at the lower layer but getting low quality "help" in your editor. I am totally pro- compilers and high-level languages, but I think the idea of writing assembly with the help of a partial compiler where you kind of write stuff and then copy/paste the result into your assembly file with some munging to fix issues is dumb.

By all means, though: if someone gets us to the point where the "code" I am checking in is a bunch of English -- for which I will likely need a law degree in addition to an engineering background to not get evil genie with a cursed paw results from it trying to figure out what I must have meant from what I said :/ -- I will think that's pretty cool and will actually be a new layer of abstraction in the same class as compiler... and like, if at that point I don't use it, it will only be because I think it is somehow dangerous to humanity itself (and even then I will admit that it is probably more effective)... but we aren't there yet and "we're on the way there" doesn't count anywhere near as much as people often want it to ;P.

ripped_britches

The most underrated thing I do on nearly every cursor suggestion is to follow up with “are there any better ways to do this?”.

smcnally

A deeper version of the same idea is to ask a second model to check the first model’s answers. aider’s “architect” is an automated version of this approach.

https://aider.chat/docs/usage/modes.html#architect-mode-and-...

avandekleut

I always ask it to "analyze approached to achieve X and then make a suggestion, no code" in the chat. Then a refinement step where I give feedback on the generated code. I also always try to give it an "out" between making changes and keeping it to same to stave off the bias of action.

cootsnuck

Yea, the "analyze and explain but no code yet" approach works well. Let's me audit its approach beforehand.

55555

I used to know things. Then they made Google, and I just looked things up. But at least I could still do things. Now we have AI, and I just ask it to do things for me. Now I don't know anything and I can't do anything.

deltaburnt

I feel like I've seen this comment so many times but actually genuine. The cult like dedication is kind of baffling.

nyarlathotep_

Programmers (and adjacent positions) of late strike me as remarkably shortsighted and myopic.

Cheering for remote work leading to loads of new positions being offered overseas opposed to domestically, and now loudly celebrating LLMs writing "boilerplate" for them.

How folks don't see the consequences of their actions is remarkable to me.

yellowapple

In both cases, you get what you pay for.

shihab

I think that example says more about the company that chose to put that code as a demo in their homepage.

999900000999

LLMs also love to double down on solutions that don't work.

Case in point, I'm working on a game that's essentially a website right now. Since I'm very very bad with web design I'm using an LLM.

It's perfect 75% of the time. The other 25% it just doesn't work. Multiple LLMs will misunderstand basic tasks. Let's add properties and invent functions.

It's like you had hired a college junior who insists their never wrong and keeps pushing non functional code.

The entire mindset is whatever it's close enough, good luck.

God forbid you need to do anything using an uncommon node module or anything like that.

smcnally

> LLMs also love to double down on solutions that don't work.

“Often wrong but never in doubt” is not proprietary to LLMs. It’s off-putting and we want them to be correct and to have humility when they’re wrong. But we should remember LLMs are trained on work created by people, and many of those people have built successful careers being exceedingly confident in solutions that don’t work.

999900000999

The issue is LLMs never say:

"I don't know how to do this".

When it comes to programming. Tell me you don't know so I can do something else. I ended up just refactoring my UX to work around it. In this case it's a personal prototype so it's not a big deal.

deltaburnt

So now you have an overconfident human using an overconfident tool, both of which will end up coding themselves into a corner? Compilers at least, for the most part, offer very definitive feedback that act as guard rails to those overconfident humans.

Also, let's not forget LLMs are a product of the internet and anonymity. Human interaction on the internet is significantly different from in person interaction, where typically people are more humble and less overconfident. If someone at my office acted like some overconfident SO/reddit/HN users I would probably avoid them like the plague.

generalizations

> people who blindly “autocomplete” this code are going to stall in their skill level progress

AI is just going to widen the skill level bell curve. Enables some people to get away with far more mediocre work than before, but also enables some people to become far more capable. You can't make someone put in more effort, but the ones who do will really shine.

dizhn

Anybody care to comment whether the quality of the existing code influences how good the AI's assistance is? In other words, would they suggest sloppy code where the existing code is sloppy and better (?) code when the existing code is good?

cootsnuck

What do you think? (I don't mean that in a snarky way.) Based on how LLMs work, I can't see how that would not be the case.

But in my experience there are nuances to this. It's less about "good" vs "bad"/"sloppy" code and more about discernable. If it's discernably sloppy (i.e. the type of sloppy a beginning programmer might do which is familiar to all of us) I would say that's better than opaque "good" code (good really only meaning functional).

These things predict tokens. So when you use them, help them increase their chances of predicting the thing you want. Good comments on code, good function names, explain what you don't know, etc. etc. The same things you would ideally do if working with another person on a codebase.

wsxiaoys

Never imagined our project would make it to the HN front page on Sunday!

Tabby has undergone significant development since its launch two years ago [0]. It is now a comprehensive AI developer platform featuring code completion and a codebase chat, with a team [1] / enterprise focus (SSO, Access Control, User Authentication).

Tabby's adopters [2][3] have discovered that Tabby is the only platform providing a fully self-service onboarding experience as an on-prem offering. It also delivers performance that rivals other options in the market. If you're curious, I encourage you to give it a try!

[0]: https://www.tabbyml.com

[1]: https://demo.tabbyml.com/search/how-to-add-an-embedding-api-...

[2]: https://www.reddit.com/r/LocalLLaMA/s/lznmkWJhAZ

[3]: https://www.linkedin.com/posts/kelvinmu_last-week-i-introduc...

maille

Do you have a plugin for MSVC?

wsxiaoys

Not yet, consider subscribe https://github.com/TabbyML/tabby/issues/322 for future updates!

somberi

https://github.com/codespin-ai/codespin-vscode-extension

tootie

Is it only compatible with Nvidia and Apple? Will this work with an AMD GPU?

wsxiaoys

Yes - AMD GPU is supported through vulkan backend:

https://github.com/TabbyML/tabby/releases/tag/v0.23.0

https://tabby.tabbyml.com/blog/2024/05/01/vulkan-support/

thih9

As someone unfamiliar with local AIs and eager to try, how does the “run tabby in 1 minute”[1] compare to e.g. chatgpt’s free 4o-mini? Can I run that docker command on a medium specced macbook pro and have an AI that is comparably fast and capable? Or are we not there (yet)?

Edit: looks like there is a separate page with instructions for macbooks[2] that has more context.

> The compute power of M1/M2 is limited and is likely to be sufficient only for individual usage. If you require a shared instance for a team, we recommend considering Docker hosting with CUDA or ROCm.

[1]: https://github.com/TabbyML/tabby#run-tabby-in-1-minute

    docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model StarCoder-1B --device cuda --chat-model Qwen2-1.5B-Instruct

[2]: https://tabby.tabbyml.com/docs/quick-start/installation/appl...

coder543

gpt-4o-mini might not be the best point of reference for what good LLMs can do with code: https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...

A teeny tiny model such as a 1.5B model is really dumb, and not good at interactively generating code in a conversational way, but models in the 3B or less size can do a good job of suggesting tab completions.

There are larger "open" models (in the 32B - 70B range) that you can run locally that should be much, much better than gpt-4o-mini at just about everything, including writing code. For a few examples, llama3.3-70b-instruct and qwen2.5-coder-32b-instruct are pretty good. If you're really pressed for RAM, qwen2.5-coder-7b-instruct or codegemma-7b-it might be okay for some simple things.

> medium specced macbook pro

medium specced doesn't mean much. How much RAM do you have? Each "B" (billion) of parameters is going to require about 1GB of RAM, as a rule of thumb. (500MB for really heavily quantized models, 2GB for un-quantized models... but, 8-bit quants use 1GB, and that's usually fine.)

eurekin

Also context size significantly impacts ram/vram usage and in programming those chats get big quickly

Ringz

Thanks for your explanation! Very helpful!

eric-burel

Side question : open source models tend to be less "smart" than private ones, do you intend to compensate by providing a better context (eg query relevant technology docs to feed context)?

qwertox

> Toggle IDE / Extensions telemetry

Cannot be turned off in the Community Edition. What does this telemetry data contain?

andypants

    struct HealthState {
        model: String,
        chat_model: Option<String>,
        device: String,
        arch: String,
        cpu_info: String,
        cpu_count: usize,
        cuda_devices: Vec<String>,
        version: Version,
        webserver: Option<bool>,
    }

https://tabby.tabbyml.com/docs/administration/usage-collecti...

KronisLV

For something similar I use Continue.dev with ollama, it’s always nice to see more tools in the space! But as usual, you need pretty formidable hardware to run the actually good models, like the 32B version of Qwen2.5-coder.

chvid

All the examples are for code that would otherwise be found in a library. Some of the code is of dubious quality.

LLMs - a spam bot for your codebase?

SOLAR_FIELDS

> How to utilize multiple NVIDIA GPUs?

| Tabby only supports the use of a single GPU. To utilize multiple GPUs, you can initiate multiple Tabby instances and set CUDA_VISIBLE_DEVICES (for cuda) or HIP_VISIBLE_DEVICES (for rocm) accordingly.

So using 2 NVLinked GPU's with inference is not supported? Or is that situation different because NVLink treats the two GPU as a single one?

wsxiaoys

> So using 2 NVLinked GPU's with inference is not supported?

To make better use of multiple GPUs, we suggest employing a dedicated backend for serving the model. Please refer to https://tabby.tabbyml.com/docs/references/models-http-api/vl... for an example

SOLAR_FIELDS

I see. So this is like, I can have tabby be my LLM server with this limitation or I can just turn that feature off and point tabby at my self hosted LLM as any other OpenAI compatible endpoint?

wsxiaoys

Yes - however, the FIM model requires careful configuration to properly set the prompt template.

mlepath

Awesome project! I love the idea of not sending my data to a big company and trust their TOS.

The effectiveness of coding assistant is directly proportional to context length and the open models you can run on your computer are usually much smaller. Would love to see something more quantified around the usefulness on more complex codebases.

fullstackwife

I hope for proliferation of 100% local coding assistants, but for now the recommendation of "Works best on $10K+ GPU" is a show stopper, and we are forced to use the "big company". :(

danw1979

It’s not really that bad. You can run some fairly big models on an Apple Silicon machine costing £2k (M4 Pro Mac Mini with 64GB RAM).

mjrpes

What is the recommended hardware? GPU required? Could this run OK on an older Ryzen APU (Zen 3 with Vega 7 graphics)?

coder543

The usual bottleneck for self-hosted LLMs is memory bandwidth. It doesn't really matter if there are integrated graphics or not... the models will run at the same (very slow) speed on CPU-only. Macs are only decent for LLMs because Apple has given Apple Silicon unusually high memory bandwidth, but they're still nowhere near as fast as a high-end GPU with extremely fast VRAM.

For extremely tiny models like you would use for tab completion, even an old AMD CPU is probably going to do okay.

mjrpes

Good to know. It also looks like you can host TabbyML as an on-premise server with docker and serve requests over a private network. Interesting to think that a self-hosted GPU server might become a thing.

wsxiaoys

Check https://www.reddit.com/r/LocalLLaMA/s/lznmkWJhAZ to see a local setup with 3090.

mkl

That thread doesn't seem to mention hardware. It would be really helpful to just put hardware requirements in the GitHub README.

mindcrime

Very cool. I'm especially happy to see that there is an Eclipse client[1]. One note though: I had to dig around a bit to find the info about the Eclipse client. It's not mentioned in the main readme, or in the list of IDE extensions in the docs. Not sure if that's an oversight or because it's not "ready for prime time" yet or what.

[1]: https://github.com/TabbyML/tabby/tree/3bd73a8c59a1c21312e812...

larwent

I’ve been using something similar called Twinny. It’s a vscode extension that connects to an ollama locally hosted LLM of your choice and works like CoPilot.

It’s an extra step to install Ollama, so not as plugnplay as tfa but the license is MIT which makes it worthwhile for me.

https://github.com/twinnydotdev/twinny

leke

So does this run on your personal machine, or can you install it on a local company server and have everyone in the company connect to it?

wsxiaoys

Tabby is engineered for team usage, intended to be deployed on a shared server. However, with robust local computing resources, you can also run Tabby on your individual machine. Check https://www.reddit.com/r/LocalLLaMA/s/lznmkWJhAZ to see a local setup with 3090.