Building AI agents to query your databases
51 comments
·March 14, 2025bob1029
ozim
A bit of a joke here:
It solves 100% cases where some manager requests dashboard never to look at it again after one day.
aargh_aargh
There's no smoke without fire.
mritchie712
Using a semantic layer is the cleanest way to have a human in the loop. A human can validate and create all important metrics (e.g. what does "monthly active users" really mean) then an LLM can use that metric definition whenever asked for MAU.
With a semantic layer, you get the added benefit of writing queries in JSON instead of raw SQL. LLM's are much more consistent at writing a small JSON vs. hundreds of lines of SQL.
We[0] use cube[1] for this. It's the best open source semantic layer, but there's a couple closed source options too.
zerop
Currently exploring cube for a "natural language to SQL" solution.
My schema is - 90+ Tables, 2500+ Columns, well documented
From your experience, does Cube look a fit? My use cases will definitely have JOINS.
mritchie712
yes, that shouldn't be a problem.
with that many tables, you might want to use Views: https://cube.dev/docs/reference/data-model/view
efromvt
JSON generation against a semantic layer and validation loops is definitely the easiest way to get high 9s success for going directly to a correct query from text. For the human in the loop cases, going directly to SQL can be fun - I’ve toyed with a SQL-like semantic layer that removes the need for direct table access and joins, which removes two of the risk points for LLMs going off the rails while still leveraging a lot of the baked in knowledge about SQL syntax (window functions, transformations, etc) that can be hard to exhaustively bake into the semantic layer. (It’s annoying when the semantic layer doesn’t quite have what you need.)
mritchie712
duckdb has a `json_serialize_sql` function that we've been messing with. It could be an interesting middle ground. It lets you write SQL as JSON.
tillvz
Completely agree! A semantic layer is essential for scaling analytics to enterprise complexity.
Another alternative here is Veezoo [0], which combines the semantic layer (a Knowledge Graph) and self-service analytics into one integrated solution.
We built it specifically for the analytics use-case for both the "data persona" to manage the semantic layer, as well as for the "business persona" to analyze the data.
If you’re looking for a semantic-layer + (embedded) BI solution right out of the box. This could be a fit.
pbronez
How well is that working for you?
We use a pattern where we ETL things into tables that model the upstream source closely, then use SQL Views to tighten up the model and integrate across data sources where needed. Keeping this all inside one DB allows us to use tools that understand the schema for autocomplete, etc.
I expect the developer experience would be significantly worse if we started writing views in YAML instead of SQL… but you’ve found the opposite?
mritchie712
Do you care about self-serve analytics (i.e. allowing people that don't know SQL to explore your data)?
A semantic layer is the best way to enable self-serve analytics, but if you don't care about it, it's probably not worth the hassle.
We also use the semantic layer for other nice things like setting goals, column descriptions and other metadata.
spolu
Hi, you are right that things can go sideways fast. In practice, the data that the typical employee needs is also quite simple. So there is definitely a very nice fit for this kind of product with a large number of use-case that we do see provide a lot of value internally for employees (self access to data) and data scientist (reducing loads).
For complex queries/use-cases, we generally instead push our users to create agents that assist them in shaping SQL directly, instead of going directly from text to result/graphs. Pushes them to think more about correctness while still saving them tone of time (the agent has access to the table schemas etc...), but not a good fit for non technical people of course.
abirch
This works well until it doesn’t. As long as there is someone who is responsible for the data correctness. E.g. the cardinality of two joining tables maintained cardinality instead of: there’s currently no one in the system in with two locations in the employee_to_location table so it works right now. One it happens there will be the wrong employee count from this query
fsndz
some people say agents can help. but still difficult to pull it without robust human in the loop: https://medium.com/thoughts-on-machine-learning/build-a-text...
null
ritz_labringue
It does require writing good instructions for the LLM to properly use the tables, and it works best if you carefully pick the tables that your agent is allowed to use beforehand. We have many users that use it for every day work with real data (definitely not toy problems).
iLoveOncall
If only we had a language to accurately describe what we want to retrieve from the database! Alas, one can only dream!
spolu
Yes you are perfectly right. Our product pushes users to be selective on the tables they give access to a given agent for a given use-case :+1:
The tricky part is correctly supporting multiple systems which each have their own specificity. All the way to Salesforce which is an entirely different beast in terms of query language. We're working on it right now and will likely follow-up with a blog post there :+1:
scottbcovert
Salesforce architect here (from partner firm, not the mothership directly)--Salesforce's query language, SOQL, is definitely a different beast as you say. I'd like to learn more about the issues you're having with the integration, specifically the permissions enforcement. I may be misunderstanding what you meant in the blog post, but if you're passing a SOQL query through the REST API then the results will be scoped by default to the permissions of the user that went through the OAuth flow. My email is in my profile if you're open to connecting.
troupo
> It does require writing good instructions for the LLM to properly use the tables
--- start quote ---
prompt engineering is nothing but an attempt to reverse-engineer a non-deterministic black box for which any of the parameters below are unknown:
- training set
- weights
- constraints on the model
- layers between you and the model that transform both your input and the model's output that can change at any time
- availability of compute for your specific query
- and definitely some more details I haven't thought of
https://dmitriid.com/prompting-llms-is-not-engineering
--- end quote ---
aargh_aargh
What else is engineering then if not taming the unknown and the unknowable? How is building a bridge any different? Do you know everything in advance about the composition of terrain, the traffic, the wind and the earthquakes? Or are you making educated guesses about unknown quantities to get something that fits into some parameters that are Good Enough(TM) for the given purpose?
gcanyon
My general method for things like this is to:
1. Get a .dot file of the database. Many tools will export this. 2. Open the .dot in a tool I built for the purpose. 3. Select the tables I'm interested in, and export a subset of the .dot file representing just those tables and relationships. 4. Hand that subset .dot file to the LLM and say, "given this schema, write a query -- here's what I want: <rest of the request here>"
That gets the job done 60% of the time. Sometimes when there's an incorrect shortcut relationship resulting in the wrong join I'll have to redirect with something like, "You need to go through <table list> to relate <table X> to <table Y>" That gets my success rate up above 95%. I'm not doing ridiculous queries, but I am doing recursive aggregations successfully.
totalhack
You are doing a lot of the work a semantic layer would do for you. I wonder if you would have better luck having the LLM talk to a semantic layer instead of directly to the database.
jt_b
Can you talk more about what an implementation of such a semantic layer would look like?
totalhack
There are a number of semantic layer tools out there these days. Each has their own unique approach, but essentially it's a meta layer on top of your database that can be used to do things like form queries or provide a consolidated API to your data (which may be in multiple databases).
Some comments on this thread mention popular semantic layer tools like cube.dev. I also made an open source one that I use regularly, though it's currently in I-hope-to-put-more-time-into-this-someday mode. Been busy with an acquisition this year.
rishabhparikh
Not OP, but a naive guess is it would mean that you'd have your schema defined in an ORM (for example, Prisma). The advantage here is that the LLM gets context on both the schema and how the schema is used throughout the application.
MattDaEskimo
I use Weaviate and let the model create GraphQL queries to take advantage of both the semantic and data layer. Not sure how efficient it is but it's worked for me
Der_Einzige
Combining this with structured/constrained generation with grammars/pydantic can supercharge this btw
gcanyon
Can you illustrate this a little? Or should I be asking an LLM for advice? :-)
breadwinner
Something like this is what I do also. Is there another way? How are others doing it?
dimitri-vs
Export DDL of one or more relevant tables, run query to get a random sample of 5 records for each table. Really quick to gather all this and enough context to handle most query writing tasks with some guidance.
sitkack
Schema + samples. Same thing a skilled person would use.
monkeydust
Anyone else wishing their town halls were this sexy?
ilaksh
On the off chance anyone is interested, I just built a Supabase plugin for MindRoot: https://github.com/runvnc/mr_supabase
MindRoot: https://github.com/runvnc/mindroot
Demo of agent using it to update a database after doing research: https://www.youtube.com/watch?v=nXwwSj0KSXI
If you give it the actual database password it will use a Postgres client to get the actual information_schema and put it in the system message. It uses the normal Supabase library for everything else.
mritchie712
We (https://www.definite.app/) solved this a bit differently.
We spin up a data lake and pipelines (we support 500+ integrations / connectors) to populate the data lake for you then put DuckDB on top as a single query engine to access all your data.
yonl
This is really interesting. At my previous company, I built a data lakehouse for operational reporting with recency prioritization (query only recent data, archive the rest). While there was no LLM integration when I left, I've learned from former colleagues that they've since added a lightweight LLM layer on top (though I suspect Dustt's implementation is more comprehensive).
Our main requirement was querying recent operational data across daily/weekly/monthly/quarterly timeframes. The data sources included OLTP binlogs, OLAP views, SFDC, and about 15 other marketing platforms. We implemented a datalake with our own query and archival layers. This approach worked well for queries like "conversion rate per channel this quarter" where we needed broad data coverage (all 17 integrations) but manageable depth (reasonable row scanned).
This architecture also enabled quick solutions for additional use cases, like on-the-fly SFDC data enrichment that our analytics team could handle independently. Later, I learned the team integrated LLMs as they began dumping OLAP views inside the datalake for different query types, and eventually replaced our original query layer with DuckDB.
I believe approaches like these (what I had done as in house solution and what definite may be doing more extensively) are data and query-pattern focused first. While it might initially seem like overkill, this approach can withstand organizational complexity challenges - with LLMs serving primarily as an interpretation layer. From skimming the Dustt blog, their approach is refreshing, though it seems their product was built primarily for LLM integration rather than focusing first on data management and scale. They likely have internal mechanisms to handle various use cases that weren't detailed in the blog.
mattpope
I feel like this misses the most useful technique for this (or really glossed over any techniques at all): flat denormalized schemas. If you have an ENUM column, much like a human the LLM won't have enough context to query it. And much like a human, 4 way JOINs get tricky. Having one large sparse maxtrix-like table with many boolean columns like `is_sale_complete`, `is_sale_cancelled` data warehouse style is straight up the key to effectively use an LLM to generate SQL.
runekaagaard
I use this every day: https://github.com/runekaagaard/mcp-alchemy
compacct27
I’m sorry, does this send your database to Claude?
runekaagaard
Yeah, Claude can query and read the result sets. It does not send the entire database.
gavinray
Interesting -- I work on a similar tool [1], and the JSON IR representation for your query is similar to the internal IR we used for our data connectors.
lennythedev
Just messed around with the concept last week. With a good enough schema explanation, bigger reasoning models did an amazing job. Definitely something I'm going to use in my dataflows.
gergely
How do you solve the following problems? 1. Whenever you run a natural language question the result will be the same for the same question? 2. How do you teach an agent/LLM for the datamodel of the application that stores the data in the databases?
zurfer
1. By having one obvious documented way of answering the questions or accepting that humans would also come to different results.
2. Investing in good data models and documentating the edge cases. Also learning from explicit feedback is powerful.
This helps our customers at getdot.ai get highly reliable results.
The models also got a lot better at making sense of weird and complex data models in the last months.
LaGrange
I mean this is generally repulsive, but please I beg of you, run this exclusively against a read only replica. I mean, you should have one for exploratory queries _anyway_, but nobody ever does that.
"Validating the query to ensure it's safe and well-formed" all I can say to that is "ROFL. LMAO."
cdblades
Yep. MCP is a project run by Anthropic (https://github.com/modelcontextprotocol) that "enables seamless integration between LLM applications and external data sources and tools."
They host a repo with lots of what they call "reference" implementations, including this one for postgres which naively takes a query from a request and shoves it at your database with no validation, escaping, anything: https://github.com/modelcontextprotocol/servers/blob/7d6cdb6...
There's an issue calling this out, and it's been labeled a bug, but still.
When you go to their documentation, the only example of actually building a server is this section where they just...use an LLM to build it. Here's their prompt:
> Build an MCP server that: - Connects to my company's PostgreSQL database - Exposes table schemas as resources - Provides tools for running read-only SQL queries - Includes prompts for common data analysis tasks
So, I think there's a fairly high chance that all of the "reference implementations" they're hosting are AI generated with basically no security considerations or code reviews made.
RajT88
Thank you.
"AI Bot, summarize the number of logins this week broken down by email address and password hash".
Sayyidalijufri
Cool
> This abstraction shields users from the complexity of the underlying systems and allows us to add new data sources without changing the user experience.
Cursed mission. These sorts of things do work amazingly well for toy problem domains. But, once you get into more complex business involving 4-way+ joins, things go sideways fast.
I think it might be possible to have a human in the loop during the SQL authoring phase, but there's no way you can do it clean without outside interaction in all cases.
95% correct might sound amazing at first, but it might as well be 0% in practice. You need to be perfectly correct when working with data in bulk with SQL operations.