Classic Data science pipelines built with LLMs

88 comments

·February 9, 2025

owenthejumper

This hits home. I am helping someone analyze medical research data. When I helped before a few years ago we spent a few weeks trying to clean the data, figure out how to run the basic analysis (linear regression, etc), only to arrive at "some" results that were never repeatable because we learned as we built.

I am doing it again now. I used Claude to import the data from CSV into a database, then asked it to help me normalize it, which output a txt file with a lot of interesting facts about the data. Next step I asked to write a "fix data" script that will fix all the issues I told it about.

Finally, I said "give me univariate analysis, output the results into CSV / PNG and then write a separate script to display everything in a jupyter notebook".

Weeks of work into about 2 hours...

mritchie712

we've built a business[0] around this workflow, but in cases where the source data isn't as simple as a CSV. Think Stripe, Hubspot, Salesforce, etc. where you'd normally need to write a ton of API calls or buy something like Fivetran. The flow for Definite is:

1. Add your sources (Postgres, S3, CRM, Quickbooks, Google Sheets, etc.)

2. We deploy standard, pre-baked data models (e.g. how do you calculate ARR using Stripe data)

3. AI answers questions using the standard models and starts updating the model with SQL for anything that's not already answered.

We spin up a datalake to store all the data (similar to this one[1]) for our customers, so it's very cost effective.

0 - https://www.definite.app/

1 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

DeathArrow

>Weeks of work into about 2 hours...

Only if the output from Claude is correct. If not...

voidhorse

This. I get why people have started using LLMs for this and I think it's great in theory, but the black box nature and possibility of hallucination makes it a non starter for me. Having the LLM generate scripts which you can then validate for correctness seems more plausible.

I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.

huijzer

The output is not black box. I always see myself as responsible for the output. The models give hints.

raducu

> Only if the output from Claude is correct. If not...

Had a task at work to clear unused metrics.

Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.

Got 22 used metrics.

Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.

46 used metics.

Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.

Re-checked the one-liners chat-gpt produced. Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.

In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.

I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.

HumanOstrich

Are you sure you didn't also have a bunch of typos in your prompts? ;)

owenthejumper

But I am not giving Claude a csv and saying 'clean it up'. I am asking it to write me a python script to clean it up. That way I can validate the script myself.

lyu07282

Think about it logically: Are you really sure you can validate the script yourself? If it takes you weeks to do what Claude does in some hours, it seems misplaced confidence in your capabilities.

noja

Are you aware of this tool? https://openrefine.org

axpy906

I’ve come to this same conclusion. Been able to code up something that would’ve taken me a week to do back in the day with Claude in 2 hours. I’ve given canvas csvs and seen it run analysis on them in minutes that would’ve take me day to do when I used to run R scripts and throw them into slides. This probably just the beginning too…

squigz

What happens when that 'weeks of work' is just shifted into the future, as you find out the LLM made things up and you have to figure out what went wrong?

fifilura

Humans make mistakes too.

I find this "LLMs can be wrong" argument a bit tiresome, and also a bit lazy.

I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.

Yoric

> Humans make mistakes too.

Well, yes, but fortunately, we build computers to automate things using simple algorithms to remove the risk of such mistakes.

Except when we use LLMs, in which case we increase the risk of mistakes.

> I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.

Well, Wikipedia is a great tool, but it is permanently weaponized.

C/Assembler vs. garbage-collected languages was about decreasing the risk (at the cost of increasing the resource requirement), so, unless I misunderstand what you write, it kinda feels like you're arguing against your side?

squigz

Funny you mention Wikipedia, since in most professional settings (particularly research roles) you can't just cite Wikipedia. Maybe in highschool that's okay, but when there are actual stakes on the table, putting some effort into your research beyond reading the Wikipedia article is probably necessary.

williamcotton

For my ETL pipelines I have not had this issue.

arscan

“I am doing it again now” is the operative phrase here I think. I’ve found LLMs are quite good helping me build things much better and faster in this case. Maybe not so much for stuff I haven’t done before and don’t really quite know what I’m trying to accomplish or what a good solution looks like.

fermisea

Can I ask you to beta test my product? I'm building something like this and I want to focus on medical data (from omics to RCTs)

Cheer2171

I really don't mean this in a rude way, but if it took you a few weeks to do that on your own, you are really bad at googling for tutorials and walkthroughs. You could have watched a one hour bootcamp video and learned how to do it yourself.

What you are saying Claude helped you do is like 15 lines of python. A few weeks? 120 hours of effort?

mritchie712

the task above is not 15 lines of python with a real world dataset.

the tutorials you reference? yes, 15 lines of python when you're starting with the titanic.csv. But a real world dataset normally takes hours or days of cleaning before it's ready to run any statistical analysis on.

Cheer2171

Data cleaning is hard. That is not what OP said they had Claude do. They just said Claude normalized it. Normalizing data does not take days unless you are learning to do both statistics and programming for the first time ever

null

[deleted]

erikgahner

Most of these examples/walkthroughs look like they have been generated by LLMs. They might be useful for teaching purposes, but I believe they significantly underestimate the amount of domain knowledge that often goes into data extraction, data cleaning and data wrangling.

tsumnia

I'm not against that approach (though I am a teacher so guilty as charged).

Toy examples help teach a concept and it helps when the example is relevant to the learner's interest. However at some point, we can't design real world application examples because so much additional mess has to get thrown in there. For example, a blog for learning web development isn't really useful to many but helps outline the basics of URL parameters, GET/POST requests, database management, etc.

It is on the learner to then take those skills and use them elsewhere. Or like it would do when I was learning, ignore the blog and make your own thing but roughly following the example.

galgia

+ I assumed that most people will ctrl+a -> ctrl+c -> ChatGPT -> ctrl+v

tsumnia

I will admit over reliance on AI is a major issue that we're coming to terms with right now. However to invoke playing devil's advocate, a person over relying on stimulants can also be a bad thing.

In moderation, AI can be fine and help. If you're assuming AI gets to do all the work while you sit around sipping mai tais and eating bonbons, you're going to have a rough time - which is exactly what we're starting to see with students that have been Copilot and GPTing through their classes. They're finally hitting the more complex stuff that needs creative thinking and problem solving skills that just aren't trained yet.

dkarl

An LLM would need a lot of integrations to send the emails, Slack messages, and meeting invites to find out all the required domain knowledge. They're basically a full-fledged employee who could take on a management role at that point.

galgia

You are right! This is here to be used when your resources do not allow you to build full-blow solutions. Yes, I used LLMs to help create examples from my existing code, but they are based on things I have put in production when the client's resources were limited and wanted to move from point 0 to test out the potential of LLMs on their data.

lmeyerov

Afaict this skips the evals and alignment side of LLMs. We find result quality is where most of our AI time goes when helping teams across industries and building our own LLM tools. Calling an API and getting bad results is the easy part, while ensuring good results is the part we get paid for.

If you look at tools like dspy, even if you disagree with their solutions, much of their effort is on helping get good vs bad results. In practice, I find different LLM use cases to have different correctness approaches, but it's not that many. I'd encourage anyone trying to teach here to always include how to get good results for every method presented, otherwise it is teaching bad & incomplete methods.

plaidfuji

This is where things are headed. All that ridiculous busywork that goes into ETL and modeling pipelines… it’s going to turn into “here’s a pile of data that’s useful for answering a question, here’s a prompt that describes how to structure it and what question I want answered, and here’s my oauth token to get it done.” So much data cleaning and prep code will be scrapped over the next few years…

benrutter

I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening.

80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).

I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

For areas that are reliability focused, LLMs still need a lot more improvments to be useful.

[0] https://github.com/benrutter/wimsey

timr

> I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.

My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.

galgia

Yes, LLMs are not always the best option, they are an option. Sometimes requirements of the project are such that they are also the best option.

There is one browser that uses price matching example that is impossible to do without a full-blown data science team right now: https://github.com/Pravko-Solutions/FlashLearn/tree/main/exa...

icedchai

Hah. I remember being forced to use MapReduce for a tiny dataset, back in the early 2010's. Hadoop was all the rage.

miningape

"lemme just fire up a dbt workflow to analyse this CSV file"

kipukun

For your wimsey library, using “pipe” to validate the contracts would seem to me to drastically slow down the Polars query because the UDF pushes the query out of Rust into Python. I think a cool direction would be to have a “compiler” which takes in a contract and spits out native queries for a variety of dataframe libraries (pandas/polars/pyspark). It becomes harder to define how to error with a test contract but that can be the secret sauce.

benrutter

Actually you're almost 100% describing how Wimsey works! It's using native df code rather than a UDF of some kind. Under the hood it uses Narwhal's which converts polars style expressions into native pandas/polars/spark/dask code with super minimal overheads.

If you're using a lazy dataframe (via polars, spark etc) Wimsey will force collection, so that can have speed implications. Reason being that I can't find a cross-language way yet of embedding assertions for fail later down the line.

galgia

I belive that LLMs will become better and better in the near future and pipelines will replace classic approaches with LLM-enriched pipelines will drastically simplify the ETL flows.

isaacremuant

Not that I don't love LLMs and play with them and their potential but if we don't get proper mechanism that ensure quality and consistency then it's not really a substitute for what we have.

It's very easy to produce something that seemingly works but you can't attest to its quality. The problem is producing something resilient, that is easy to adapt and describes the domain of what you want to do.

If all these things are so great, them why do I still need to do so many things to integrate a bigtech cloud agent with popular tool? Why is it so costly or limited?

UX matters, validation matters, reliability matters, cost matters.

You can't simply wish for a problem not to happen. Someone owns the troubleshooting and the modification and they need to understand the system they're trying to modify.

Replacing scrapers with LLM is an easy and obvious thing, specially when you don't care about quality to a high degree. Other systems such as financial ones don't have that luxury.

benrutter

You may be right! I guess we'll find out soon.

One thing I'd be wary of is what "LLM-enriched pipelines" look like. If it's "write a sentence and get a pipeline" then I think that does massively simplify the ammount of work, but there's another reality where people use LLMs to get more features out of existing data, rather than doing the same transformations we do now. Under that one, ETL pipelines would end up taking more time, and being more complex.

Yoric

But at what cost?

We're in an energy/environmental crisis, and we're replacing simple pipelines with (unreliable) gas factories?

drunkpotato

This is a head-scratcher of a take. Have you actually done any in-depth work on data pipelines and analytics tooling? If so, what precisely do you see LLMs making easier?

I tried using enterprise chat gpt to write a query to load some json data into a data warehouse. I was impressed with how good a job it did, but it still required several rounds of refinement and hand-holding and the end result was almost, but not quite, correct. So I'm not coming at this from the perspective of hating LLMs a priori, but I am unimpressed with the hype and over-selling of its capabilities. In the end, it was no faster than writing the query myself, but it wasn't slower either, so I can see it being somewhat helpful in limited conditions.

Unless the technology makes another quantum leap improvement at the same time the price drops like a stone, I don't see LLMs coming anywhere close to your claim.

That said, I expect to see a huge amount of snake oil and enterprise dollars wastefully burned on executive pipe dreams of "here's a pile of data now magic me a better business!" in the next few years of LLM over-hyped nonsense. There's always a quick buck to make in duping clueless execs drooling over replacing pesky, annoying, "over-paid" tech people.

robwwilliams

Let me give you a complementary perspective. Same problems all of you have but I work in a small lab team of PhD biologist who generate huge omics data set and even larger lightsheet microscopy and MRI datasets but don’t know how to do a VLOOKUP in Excel. And who do not know the exotic acronyms: LIMS, QA, QC, or SQL. Yes, really.

What do we typically do in academic biomedical research in this situation?

The lead PI looks around the lab and finds a grad student or postdoc who knows how to turn on a computer and if very lucky also has had 6 months of experience noodling around with R or Python. This grad or postdoc is then charged with running some statistical analyses without any training whatsoever in data science. What is an outlier anyway, what do you mean by “normalize”, what is metadata exactly?

You get my drift: It is newbies in data science and programming (often 40-and 50-year-olds) leading novices (20- and 30-year-olds) to the slaughter. Might contribute to some lack of replicability ;-)

And it has been this way in the majority of academic labs since I started using CPM on an Apple 2 in 1980 at UC Davis in an electrophysiology lab in Psychology, to the first Macs I set up at Yale in a developmental neurobiology lab in 1984, and up to the point at which I set up my own lab in neurogenetics at the University of Tennessee with a pair of Mac IIs in 1989 and $150,000 in set-up funds, just enough for me to hire one very inexperience technician to help me do everything.

So in this context I hope all of you can appreciate that ANY help in bringing some real data science into mom-and-pop laboratories would be a huge huge boon.

And please god, let it be FOSS.

drunkpotato

I feel you, and LLMs are no doubt a boon in tooling to help in this kind of scenario. I'm not poo-pooing LLMs in general; they are very cool! I wish they were allowed to just be very cool while we incorporate them into our tooling and workflows, rather than over-hyped.

icedchai

You have more faith in LLMs than I do. The reality is it will probably get you 70 to 80% there, then you'll spend a ton of time debugging / fixing your pipelines, only to realize it would've been simpler, faster, and more reliable to not involve an LLM in the first place.

drunkpotato

I believe that we'll learn how to incorporate LLMs to improve parts of data pipelines, particularly those that involve extracting unstructured or semistructured data into structured data, especially if it can provide a reliability score or confidence level with the extract. I'm much more skeptical of claims beyond that.

I also think there are unanswered questions about reliability, cost (dollar and energy), and AI business models; I don't think OpenAI can burn $2+ to make a dollar forever.

owenthejumper

Unless you can provide some "citation", I don't think you are right. I do this every day now and it gets me 99 % there with very little debugging.

icedchai

As always, "it depends." How simple are your pipelines? Single CSV? Sensible column names that are totally unambiguous? Consistent, clean data? Then LLMs are probably fine...

miningape

This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.

There's so much that goes into ensuring the reliability, scalability and monitoring of production ready data pipelines. Not to mention the integration work for each use case. An LLM will give you short term wins at the cost of long term reliability - which is exactly why we already have DE teams to support DA and DS roles.

vharuck

>This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.

I agree. There is a lot of data people want that isn't made because of labor costs. Not just in quantity, but difficulty. If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.

benjiro

> If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.

That applies to so many other jobs.

My productivity as a single IT developer, making a rather large and complex system mostly skyrocketed when LLM's became actually useful (around GPT4 era).

Work where i may have spend hours dealing with a bug, being maybe 10 minutes because my brain was looking over some obvious issue that a LLM instantly spotted (or gave suggestions that focused me upon the issue).

Implementing features that may have taken days, reduces to a few hours.

Time taken to learn things massive reduces because you can ask for specific examples. Where a lot of open source project are poorly documented or missing examples or just badly structured. Just ask the LLM and it puts you in the right direction.

Now, ... this is all from the perspective of a 25+ year experienced dev. The issue i fear for more, is people who are starting out, writing code but not understanding why or how things work. I remember people before LLM's coming in for Senior jobs, that did not even have basic SQL understanding, because they non-stop used ORM's. But they forgot that some (or a lot) of this knowledge was not transferable to different companies that used SQL or other ORM's that may work different.

I suspect that we are going to see a generation of employees that are so used to LLMs doing the work but not understanding how or why specific functions or data structures are needed. And then get stuck in hours of LLM loop questioning because they can not point the LLM to the actual issue!

At time i think, i wish this was available 20 years ago. But then question that statement very fast. Was i going to be the same dev today, if i relied non-stop on LLMs and not gritted by teeth on issues to develop this specific skillset?

I see more productivity from Senior devs etc, more code turnout from juniors (or code monkies), but a gap where the skills are a issue. And lets not forget the potential issue of LLM poisoning with years of data that feeds back on itself.

galgia

I see it as a gray area - long term there will be a need for both and you will have just one tool to choose from when presented with time-budget-quality constraints.

miningape

Yeah I can also see it very much depending on the demands - I'm definitely not saying every pipeline has to be the most reliable, scalable piece of software ever written.

If a small script works for you and your use case / constraints there's nothing I can say against it, but when you do grow past a certain point you'll need pipelines built in a proper way. This is where I see the increased demand since the scrappy pipelines are already proving their value.

ekianjo

This would require massively more compute than regular pipelines...

plaidfuji

(1) that delta will decrease quickly, and (2) corporations will gladly pay for compute over headcount to maintain fragile data pipelines

timr

> (1) that delta will decrease quickly

Is your data pipeline o(n^3) in the number of tokens? If not, then no, it won't.

ekianjo

The price will go down, but LLMs reaching 100% accuracy and reliability is another story. We are nowhere close right now.

null

[deleted]

galgia

If your problem is compute, you are already optimizing. This is here for all the steps before you start thinking latency-compute. Not all use cases are made equal.

mistrial9

no, not so simple.. the simplicity of this idea is like a gravitational pull for your human mental model mind. Meanwhile, LLMs are like a non-reproducible cotton-candy machine. Quality will be an elusive light at the end of the tunnel, not a result, for non-trivial systems IMHO. Simple systems? sure, but economics will assign low-skill humans to the task, and other problems emerge.

What is the intoxication that assumes the engineering disciplines are now suddenly auto-automatable ?

Keyframe

not data pipelines, not yet at least since usually those require high degree of accuracy (depending on the company, of course). Where I see it (already) move in is data exploration, which effectively are data pipelines before data pipelines are being developed.

galgia

Good point! LLMs are best when you are starting from point 0.

galgia

Exactly!

fire_lake

Big song and dance to call the OpenAI rest endpoint.

hrpnk

What's missing in these examples are evals and any advice on creating a verification set that can be used to assert that the system continues to work as designed. Models and their prompting patterns change; one cannot just pretend that a 1-time automation will continue to work indefinitely when the environment is constantly changing.

refactor_master

This ETL is nice, but ours is 100k LOC, and spans multiple departments and employments, and I haven’t yet been able to make an LLM write a convincing test that wasn’t already solved by strict typing.

I’m not trying to move the goal post here, but LLMs haven’t replaced a single headcount. In fact, it’s only been helping our business so far.

javierluraschi

For those interested, you can use LLMs to process CSVs in Hal9 and also generate streamlit apps, in addition, the code is open source so if you want to help us improve our RAG or add new tools, you are more than welcomed.

- https://hal9.ai

- https://github.com/hal9ai/hal9

wodenokoto

I do not understand at all what this does or how to use it. Am I completely out of touch on LLM?

3abiton

I wonder if those examples can be dumped down even further for lower age brackets. One of the "powers" of LLms

ei625

LLM for ETL is Good idea, it scales well. We need to find ideas which scales well to make the business valid.

Neelschak

I tried out Querri and loving it so far

HN

Classic Data science pipelines built with LLMs

Classic Data science pipelines built with LLMs