Show HN: I built an AI dataset generator
35 comments
·June 26, 2025mritchie712
matthewhefferon
Cool, I don’t do customer-specific demos, but I like this idea. I might add this use case as an option. Thanks for sharing!
matthewhefferon
I was tired of digging through Kaggle and writing prompts over and over just to get fake data for dashboards and demos. So I built a little tool to help me out.
It uses GPT-4o to generate a detailed schema and business rules based on a few dropdowns (like business type, schema structure, and row count). Then Faker fills in the rows using those rules, which keeps it fast and cheap.
You can preview the data, export as CSV or SQL, or spin up Metabase with one click to explore the data. It’s open-source, still in early stages, but wanted to share, get feedback and see how you'd improve it.
thenaturalist
Congrats, thanks for shipping and open sourcing this!
Cool to see Metabase is enabling contributions to the ecosystem this way! :)
matthewhefferon
No problem, thanks for taking a look!
paxys
Feature request - make the URL for the OpenAI API configurable. That way one can swap it out with Anthropic or any other LLM provider of their choice that provides an OpenAI-compatible API.
matthewhefferon
I was actually thinking about this very feature in the shower this morning :)
MattSayar
I used Anthropic's new Claude API integration with artifacts to make a probably-worse version that you can play with (after logging in of course).
https://claude.ai/public/artifacts/eb7d8256-6d21-4c85-af9b-c...
I used this GitHub repo as context and Claude Opus 4 to create this artifact
NitpickLawyer
Haha, I find this kind of exercise telling for what's coming to the one-size-fits-all SaaS companies out there. I see a future where small teams can in-house the set of features they actually need, and a big drop in SaaS usage. Avoids the big vendor lock-in problems, unwanted features and bypasses all the accenture-style consulting fees.
MattSayar
Optimistically, this will allow smaller teams to do more, hopefully incentivizing the consulting places to help out with harder problems.
ChrisMarshallNY
I wrote a Swift CLI app to generate dummy user profiles for an app we wrote (I needed many more than we’ll actually get, and I needed screenshots for the App Store that didn’t have real user data).
It was pretty “dumb,” and used thispersondoesnotexist.com for profile pics.
smcleod
This is a bit confusing, I sort of expected it to be a bit like Kiln https://github.com/Kiln-AI/Kiln to generate datasets for AI, but it looks like the outputs are more just data / files than datasets?
jasonthorsness
AI is really good at this sort of thing; I've been using an LLM with Faker for some time to load data for demos into SingleStore: https://github.com/jasonthorsness/loadit
matthewhefferon
Nice, I like the challenge video!
jasonthorsness
Ha thanks, appreciate that, I regret the video a little as I was going through a short "a more exciting blog with videos is what the people want" phase.
reedlaw
"Dataset" connotes training data, but this seems to generate sample data, maybe for testing an application. Is there any use for synthetic datasets in ML?
Mamawho
Yes, check out Synthbyte.ai, we make training data and have with all sorts of datasets, including NIH data
null
klntsky
You absolutely do not need docker as a requirement here
wiradikusuma
"Stack: OpenAI API (GPT-4o for data generation)" -- I wonder if someday we'll have a generic API like how it's done in Java (e.g., Servlet API implemented by Tomcat, JBoss etc), so everyone can use their favorite LLM instead of having to register each provider like streaming services e.g. Disney+, Netflix, etc.
zild3d
isn't this essentially https://openrouter.ai/
matthewhefferon
I hope so. I'm already subscribed to every streaming service, and my wallet can't handle all these LLMs too.
margotli
Feels like a useful tool for anyone learning analytics or just needing sample data to test with.
hiatus
Are you affiliated with metabase? https://news.ycombinator.com/item?id=44107584
alienbaby
Good for the shape of data, but what about the actual data? If it's entirely random then it's more of a UI demo tool than a tool to generate useful data.
I use this prompt to spin up demos for customers at https://www.definite.app/:
Then: Only takes a few minutes in Cursor, should work just as well in Claude Code. It works really well for the companies core business, but I still need to create one to populate 3rd party sources (e.g. Stripe, Salesforce, Hubspot, etc.).