Skip to content(if available)orjump to list(if available)

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

brunohaid

Noice!

Does anyone have a good recommendation for a local dev setup that does something similar with available tools? Ie incorporates a bunch of PDFs (~10,000 pages of datasheets) and other docs, as well as a curl style importer?

Trying to wean myself off the next tech molochs, ideally with local functionality similar to OpenAIs Search + Reason, and gave up on Langchain during my first attempt 6 months ago.

throwup238

Honestly you're better off rolling your own (but avoid LangChain like the plague). The actual implementation is simple but the devil is in the details - specifically how you chunk your documents to generate vector embeddings. Every time I've tried to apply general purpose RAG tools to specific types of documents like medical records, internal knowledge base, case law, datasheets, and legislation, it's been a mess.

Best case scenario you can come up with a chunking strategy specific to your use case that will make it work: stuff like grouping all the paragraphs/tables about a register together or grouping tables of physical properties in a datasheet with the table title or grouping the paragraphs in a PCB layout guideline together into a single unit. You also have to figure out how much overlap to allow between the different types of chunks and how many dimensions you need in the output vectors. You then have to link chunks together so that when your RAG matches the register description, it knows to include the chunk with the actual documentation so that the LLM can actually use the documentation chunk instead of just the description chunk. I've had to train many a classifier to get this part even remotely usable in nontrivial use cases like caselaw.

Worst case scenario you have to finetune your own embedding model because the colloquialisms the general purpose ones are trained on have little overlap with how terms of art and jargon used in the documents (this is especially bad for legal and highly technical texts IME). This generally requires thousands of examples created by an expert in the field.

3abiton

I am looking up chunking techniques, but the resources are so scarce on this. What's your recommendation?

crishoj

> but avoid LangChain like the plague

Can you elaborate on this?

I have a proof-of-concept RAG system implemented with LangChain, but would like input before committing to this framework.

byefruit

> This generally requires thousands of examples created by an expert in the field.

Or an AI model pretending to be an expert in the field... (works well in a few niche domains I have used this in)

deoxykev

Don't forget to finetune the reranker too if you end up doing the embedding model. That tends to have outsized effects on performance for out of distribution content.

kordlessagain

I’ve been working on something that provides document search for agents to call if they need the documents. Let me know if you are interested. It’s Open Source. For this many documents it will need some bucketing with semantic relationships, which I’ve been noodling on this last year. Still needs some tweaking for what you are doing, probably. Might get you further along if you are considering rolling your own…

heywoods

Could I take a look at the repo? Thanks!

jondwillis

Continue and Cline work with local models (e.g. via Ollama) and have good UX for including different kinds of context. Cursor uses remote models, but provides similar functionality.

amrrs

I'm sorry trying to clarify - why would you use Cline (which is coding assistant) for RAG?

brunohaid

Appreciated! Didn’t know Cline already does RAG handling, thought I’d have to wire that up beforehand.

jondwillis

The title reads awkwardly to a native English speaker. A search of the PDF for "latency" returns one result, discussing how naive RAG can result in latency. What are the latency impacts and other trade-offs to achieve the claimed "[improved] answer accuracy by 21.99%"? Is there any way that I could replicate these results without having to write my own implementation?