Skip to content(if available)orjump to list(if available)

Is the doc bot docs, or not?

Is the doc bot docs, or not?

78 comments

·July 9, 2025

shlomo_z

> so I did my customary dance of order-refund, order-refund, order-refund. My credit card is going to get locked one of these days.

I don't know the first thing about Shopify, but perhaps you can create a free "test" item so you don't actually need to make a credit card transaction.

dworks

We're going to see increasingly more of these, and it's going to cause a big scandal at one point, that pops the current AI bubble. It's really obvious that you can't use non-deterministic systems this way but companies are hellbent on doing it anyway. This is why I won't take a role to implement "AI" in an existing product.

crystal_revenge

I don’t understand why people seem to be attacking the “non-determinism” of LLMs. First, I think most people are confusing “probabilistic” with “non-deterministic” which have very distinct meanings in CS/ML. Non-deterministic typically entails following multiple paths at once. Consider regex matching with NFAs or even the particular view of a list as a monad. The only case where LLMs are “non-deterministic” is when using sampling algorithms like beam search where multiple paths are considered simultaneously. But most LLM usage being discussed doesn’t involve beam search.

But even if one assumes people mean “probabilistic”, that’s also an odd critique given how probabilistic software has pretty much eaten the world. Most of my career has been building reliable product using probabilistic models.

Finally, there’s nothing inherently probabilistic or non-deterministic about LLM generation, these are properties of the sampler applied. I did quite a lot of LLM benchmarking in recent years and almost always used greedy sampling both for performance (doing things like GSM8K strong benefits from choosing the maximum likely path) and reproducibility. You can absolutely set up LLM tools that have perfectly reproducible results. LLMs have many issues but their probabilistic nature is not one of them.

TZubiri

It's not entirely unrelated, the fact that the system is non-deterministic means that it necessarily is probabilistic.

A business can reduce temperature to 0 and choose a specific seed, and it's the correct approach in most cases, but still the answers might change!

On the other hand, it's true that there is some probability that is independent of determinism, for example maybe changing the order of some words might yield different answers, this might be a deterministic machines, but there's millions of ways to frame a question, if the answer depends on trivial details of the question formatting, there's a randomness there. Similar to how there is randomness in who will win a chess match between two equally rated players, despite the game being deterministic.

pinoy420

[dead]

emil_sorensen

Docs bots like these are deceptively hard to get right in production. Retrieval is super sensitive to how you chunk/parse documentation and how you end up structuring documentation in the first place (see frontpage post from a few weeks ago: https://news.ycombinator.com/item?id=44311217).

You want grounded RAG systems like Shopify's here to rely strongly on the underlying documents, but also still sprinkle a bit of the magic of the latent LLM knowledge too. The only way to get that balance right is evals. Lots of them. It gets even harder when you are dealing with GraphQL schema like Shopify has since most models struggle with that syntax moreso than REST APIs.

FYI I'm biased: Founder of kapa.ai here (we build docs AI assistants for +200 companies incl. Sentry, Grafana, Docker, the largest Apache projects etc).

chrismorgan

Why do you say “deceptively hard” instead of “fundamentally impossible”? You can increase the probability it’ll give good answers, but you can never guarantee it. It’s then a question of what degree of wrongness is acceptable, and how you signal that. In this specific case, what it said sounds to me (as a Shopify non-user) entirely reasonable, it’s just wrong in a subtle but rather crucial way, which is also mildly tricky to test.

whatsgonewrongg

A human answering every question is also not guaranteed to give good answers; anyone that has communicated with customer service knows that. So calling it impossible may be correct, but not useful.

(We tend to have far fewer evals for such humans though.)

girvo

A human will tell you “I am not sure, and will have to ask engineering and get back to you in a few days”. None of these LLMs do that yet, they’re biased towards giving some answer, any answer.

bee_rider

Documentation is the thing we created because humans are forgetful and misunderstand things. If the doc bot is to be held to a standard more like some random discord channel or community forum, it should be called something without “doc” in the name (which, fwiw, might just be a name the author of the post came up with, I dunno what Shopify calls it).

intended

This is to move the goal posts /raise a different issue. We can engage with the new point, but this is to concede that Docs bots are not docs bots.

skrebbel

Why RAG at all?

We concatenated all our docs and tutorials into a text file, piped it all into the AI right along with the question, and the answers are pretty great. Cost was, last I checked, roughly 50c per question. Probably scales linearly with how much docs you have. This feels expensive but compared to a human writing an answer it's peanuts. Plus (assuming the customer can choose to use the AI or a human), it's great customer experience because the answer is there that much faster.

I feel like this is a no-brainer. Tbh with the context windows we have these days, I don't completely understand why RAG is a thing anymore for support tools.

cube2222

This works as long as your docs are below the max context size (and even then, as you approach larger context sizes, quality degrades).

Re cost though, you can usually reduce the cost significantly with context caching here.

However, in general, I’ve been positively surprised with how effective Claude Code is at grep’ing through huge codebases.

Thus, I think just putting a Claude Code-like agent in a loop, with a grep tool on your docs, and a system prompt that contains just a brief overview of your product and brief summaries of all the docs pages, would likely be my go to.

bee_rider

Oh man, maybe this would cause people to write docs that are easy to grep through. Let’s start up that feedback loop immediately, please.

TZubiri

That is not particularly cheap, especially since it scales linearly with doc size, and therefore time.

Additionally the quality of loading the context-window decreases linearly as well, just because your model can handle 1M tokens it doesn't mean that it WILL remember 1M tokens, it just means that it CAN

RAG fixes this, in the simplest configuration a RAG can be an index, and the only context you give the LLM is the table of contents, and you let it search through the index.

Should it be a surprise that this is cheaper and more efficient? Loading the context window is like a library having every book open at every page at the same time instead of using the dewey decimal system

cluckindan

With RAG the cost per question would be low single-digit pennies.

Rygian

What you describe sounds like poor man's RAG. Or lazy man's. You're just doing the augmentation at each prompt.

emil_sorensen

Accuracy drops hard with context length still. Especially in more technical domains. Plus latency and cost.

llm_nerd

What you described is RAG. Inefficient RAG, but still RAG.

And it's inefficient in two ways-

-you're using extra tokens for every query, which adds up.

-you're making the LLM less precise by overloading it with potentially irrelevant extra info making it harder for it to needle in a haystack the specific relevant answer.

Filtering (e.g. embedding similarity & BM25) and re-ranking/pruning what you provide to RAG is an optimization. It optimizes the tokens, the processing time, and optimizes the answer in an ideal world. Most LLMs are far more effective if your RAG is limited to what is relevant to the question.

TZubiri

I don't think it's RAG, RAG is specifically separating the search space from the LLM context-window or training set and giving the LLM tools to search in inference-time.

IceDane

Because llms still suck at actually using all that context at once. And surely you can see yourself that your solution doesn't scale. It's great that it works for your specific case but I'm sure you can come up with a scenario where it's just not feasible.

PeterStuer

Indeed. Dabbling in 'RAG' (which for better or worse has become a tag for anything context retrieval) for more complex documentation and more intricate questions, you will very quickly realize that you really need to go far beyond simple 'chunking', and end up with a subsystem that constructs more than one very intricate knowledge graphs for supporting different kinds of questions the users might ask. For example: a simple question such as "What exactly is an 'Essential Entity'? is better handled by Knowledge Representation A as opposed to "Can you provide a gap and risk analysis on my 2025 draft compliance statement (uploaded) in light of the current GDPR, NIS-2 and the AI Act?"

(My domain is regulatory compliance, so maybe this goes beyond pure documentation but I'm guessing pushed far enough the same complexities arise)

dingnuts

This is sort of hilarious; to use an LLM as a good search interface first build.. a search engine.

I guess this is why Kagi Quick Answer has consistently been one of the best AI tools I use. The search is good, so their agent is getting the best context for the summaries. Makes sense.

null

[deleted]

schnable

Reminds me of when I asked Gemini how to do some stuff in Google Docs App Script, and it just hallucinated the capability and code to make it work. Turns out what I wanted to do isn't supported at all.

I feel like we aren't properly using AI in products yet.

aDyslecticCrow

I asked about a nieche json library for c. It apparently wasn't in the training data so it just invented how it feels like a json library would work.

Ive also had alot of issues with cmake that it just invents syntax and functions. Every new question has to be made in a new chat context to clear the context poisoning.

Its the things that lack good docs i want to ask about. But that's where its most likley to fail.

dingnuts

I think users should get a refund on the tokens when this happens

braebo

Yet Google raised my workspace subscription cost by 25% last night because our current agreement is suddenly unworthy of all the new “ai value” they’ve added… value I didn’t even know existed until I started paying for it. I don’t even want to know what isis supposed to be referencing… I just want to dump it asap.

dsmmcken

The tool we use for our docs AI answers lets you mine that data for feature requests. It generates a report of what it didn't have answers for and summarizes them as potential feature gaps. (Or at least what it is aware it didn't have answers for).

People seem more willing to ask an AI about certain things then be judged by asking the same question of a human, so in that regard it does seem to surface slightly different feature requests then we hear when talking to customers directly.

We use inkeep.com (not affiliated, just a customer).

rapind

> We use inkeep.com (not affiliated, just a customer).

And what do you pay? It's crazy that none of these AI CSRs have public pricing. There should just be monthly subscription tiers, which include some number of queries, and a cost per query beyond that.

hnlmorg

I’ve found LLMs (or at least everyone I’ve tried this on) will always assume the customer is correct and thus even if they’re flat out wrong, the LLM will make up some bullshit to confirm the costumer is still correct.

It’s great when you’re looking to do creative stuff. But terrible when you’re looking to confirm the correctness of an approach or asking for support on something that you weren’t even aware of its nonexistence.

dworks

that's because its "answers" are actually "completions". cant escape that fact - LLMs will always "hallucinate".

xyst

> I feel like we aren't properly using AI in products yet.

Very similar sentiment at the height of the crypto/digital currency mania

Bewelge

To be fair, for me at least, that weird chat bot only appears on https://help.shopify.com/ while the technical documentation is on shopify.dev/.

Everytime I land on help.shopify.com I get the feeling it's one of those "Doc pages for sales people". Like it's meant to show "We have great documentation and you can do all these things" but never actually explains how to do anything.

I tried that bot a couple of months ago and it was utterly useless:

question: When using discountRedeemCodeBulkAdd there's a limit to add 100 codes to a discount. Is this a limit on the API or on the discount? So can I add 100 codes to the same discount multiple times?

answer: I wasn't able to find any results for that. Can you tell me a little bit more about what you're looking for?

Telling it more did not help. To me that seemed like the bot didn't even have access to the technical documentation. Finding it hard to believe that any search engine can miss a word like discountRedeemCodeBulkAdd if it actually is in the dataset: https://shopify.dev/docs/api/admin-graphql/latest/mutations/...

So it's a bit like asking sales people technical questions.

edit: Okay, I should have tried that before commenting. They seem to have updated it. When I ask the same question now it answers correctly (weirdly in German) :

Die Begrenzung von 100 Codes bei der Verwendung von discountRedeemCodeBulkAdd bezieht sich auf die Anzahl der Codes, die Sie in einem einzelnen API-Aufruf hinzufügen können, nicht auf die Gesamtanzahl der Codes, die einem Rabatt zugeordnet werden können. Ein Rabattcode kann bis zu 20.000.000 eindeutige Rabattcodes enthalten. Daher können Sie mehrfach jeweils 100 Codes zum selben Rabatt hinzufügen, bis Sie das Limit von 20.000.000 Codes erreicht haben. Beachten Sie, dass Drittanbieter-Apps oder benutzerdefinierte Lösungen dieses Limit nicht umgehen oder erhöhen können.

~= It's a limit on the API endpoint, you can add up to 20M to a single discount.

debugnik

> weirdly in German

I keep seeing bots wrongly prompted with both the browser language and the text "reply in the user's language". So I write to a bot in English and I get a Spanish answer.

delusional

> So it's a bit like asking sales people technical questions.

Maybe that's the best anthropomorphic analogy of LLMs. Like good sales people completely disconnected from reality, but finely tuned to give you just the answer you want.

WJW

Well no, the problem was that the bot didn't give them the answer they wanted. It's more like "finely tuned to waffle around pretending to be knowledgeable, but lacking technical substance".

Kind of like a bad salesperson, the best salespeople I've had the pleasure of knowing were not afraid to learn the technical background of their products.

barrell

The best anthropomorphic analogy for LLMs is no anthropomorphic analogy :)

dworks

to be fair?

schaum

There is also https://gurubase.io/ Which is sometimes used as a kind of talk with the documentation, it claims to validate the response somehow

ngriffiths

The doc bot goes in the same category as asking a human who has read the docs. In order of helpfulness you could get:

- "Oh yeah just write this," except the person is not an expert and it's either wrong or not idiomatic

- An answer that is reliably correct enough of the time

- An answer in the form "read this page" or quotes the docs

The last one is so much better because it directly solves the problem, which is fundamentally a search problem. And it places the responsibility for accuracy where it belongs (on the written docs).

bee_rider

I think the name, doc-bot, is just bad (actually I don’t know what Shopify even calls their thing, so maybe the confusion is on the part of the author of the post, and not some misleading thing from Shopify). A bot like that could fulfill the role of the community forum, which certainly isn’t nothing! But of course it isn’t the documentation.

simonw

This is a great example of the kind of question I'd love to be able to ask these documentation bots but that I don't trust them to be able to get right (yet):

> What’s the syntax, in Liquid, to detect whether an order in an email notification contains items that will be fulfilled through Shopify Collective?

I suspect the best possible implementation of a documentation bot with respect to questions like this one would be an "agent" style bot that has the ability to spin up its own environment and actually test the code it's offering in the answer before confidently stating that it works.

That's really hard to do - Robin in this case could only test the result by placing and then refunding an order! - but the effort involved in providing a simulated environment for the bot to try things out in might make the difference in terms of producing more reliable results.

dworks

get a second agent to validate the return from the first agent. but it might get it wrong because reasons, so you need a third agent just to make sure. and then a fourth. and so on. this is obviously not a working direction.

simonw

That's why you give them the ability to actually execute the code in a sandbox. Then it's not AI checking AI, you're mixing something deterministic into the loop.

kmoser

That may certainly increase the agent's ability to get it right, but there will always be cases where the code it generates mimics the correct response, i.e. produces the output asked for, without actually working as intended, as LLMs tend to want to please as much as be correct.

dworks

the return may still not reflect the sandbox reality.

bravesoul2

Need a real CC to test. Right there makes me lose respect for shopify if true. Even stripe let's you test :)

Bewelge

Not sure if I'm missing something but the way I'd always test orders is generate some 100% discount. You don't need any payment info then. I only ever needed a CC if I wanted to actually test something relating to payment. And on test stores you can mock a CC

bravesoul2

That's a good way too for most cases. Unless you need there to be an amount

domk

Working with Shopify is an example of something where a good mental model of how it works under the hood is often required. This type of mistake, not realising that the tag is added by an app after an order is created and won't be available when sending the confirmation email, is an easy one to make, both for a human or an LLM just reading the docs. This is where AI that just reads the available docs is going to struggle, and won't replace actual experience with the platform.

trjordan

The core argument here is: LLM docbots are wrong sometimes. Docs are not. That's not acceptable.

But that's not true! Docs are sometimes wrong, and even more so if you could errors of omission. From a users perspective, dense / poorly structured docs are wrong, because they lead users to think the docs don't have the answer. If they're confusing enough, they may even mislead users.

There's always an error rate. DocBots are almost certainly wrong more frequently, but they're also almost certainly much much faster than reading the docs. Given that the standard recommendation is to test your code before jamming it in production, that seems like a reasonable tradeoff.

YMMV!

(One level down: the feedback loop for getting docbots corrected is _far_ worse. You can complain to support that the docs are wrong, and most orgs will at least try to fix it. We, as an industry, are not fully confident in how to fix a wrong LLM response reliably in the same way.)

mananaysiempre

Docs are reliably fixable, so with enough effort they will converge to correctness. Doc bots are not and will not.