The Lost Decade of Small Data?
24 comments
·May 19, 2025Mortiffer
wodenokoto
A really big part of a in-memory dataframe centric workflow is how easy it is to do one step at a time and inspect the result.
With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.
IanCal
I'm not sure I really follow, you can create new tables for any step if you want to do it entirely within the db, but you can also just run duckdb against your dataframes in memory.
rr808
Ugh I have joined a big data team. 99% of the feeds are less than a few GB yet we have to use Scala and Spark. Its so slow to develop and slow to run.
threeseed
a) Scala being a JVM language is one of the fastest around. Much faster than say Python.
b) How large are the 1% of the feeds and the size of the total joined datasets. Because ultimately that is what you build platforms for. Not the simple use cases.
rr808
1) Yes Scala and JVM is fast. If we could just use that to clean up a feed on a single box that would be great. The problem is calling the Spark API creates a lot of complexity for developers and runtime platform which is super slow. 2) Yes for the few feeds that are a TB we need spark. The platform really just loads from hadoop transforms then saves back again.
PotatoNinja
Krazam did a brilliant video on Small Data: https://youtu.be/eDr6_cMtfdA?si=izuCAgk_YeWBqfqN
willvarfar
I only retired my 2014 MBP ... last week! It started transiently not booting and then, after just a few weeks, it switched to be only transiently booting. Figured it was time. My new laptop is actually a very budget buy, and not a mac, and in many things a bit slower than the old MBP.
Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.
DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.
Cthulhu_
I still have mine, but it's languishing, I don't know what to do with it / how to get rid of it, it doesn't feel like trash. The Apple stores do returns but for this one you get nothing, they're just like "yeah we'll take care of it".
The screen started to delaminate on the edges, and its follow-up (a MBP with the touch bar)'s screen is completely broken (probably just the connector cable).
I don't have a use for it, but it feels wasteful just to throw it away.
HPsquared
eBay is pretty active for that kind of thing. Spares/repair.
fulafel
Related in the big-data-benchmarks-on-old-laptop department: https://www.frankmcsherry.org/graph/scalability/cost/2015/01...
zkmon
A database is not only about disk size and query performance. Database reflects the company's culture, processes, workflows, collaboration etc. It has an entire ecosystem around it - master data, business processes, transactions, distributed applications, regulatory requirements, resiliency, Ops, reports, tooling etc,
The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.
While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.
willvarfar
And its very much the tech culture at large that influences the company's tech choices. Those techies chasing shiny things and trying to shoehorn it into their job - perhaps cynically to pad their cvs or perhaps generously thinking it will actually be the right thing to do - have an outsized say in how tech teams think about tech and what they imagine their job is.
Back in 2012 we were just recovering from the everything-is-xml craze and in the middle of the no-sql craze and everything was web-scale and distribute-first micro-services etc.
And now, after all that mess, we have learned to love what came before: namely, please please please just give me sql! :D
threeseed
Why you don't just quietly use SQL instead of condescending lecturing others about how compromised their tech choices are.
NoSQL e.g. Cassandra, MongoDB and Microservices were invented to solve real-world problems which is why they are still so heavily used today. And the criticism of them is exactly the same that was levelled at SQL back in the day.
It's all just tools at the end of the day and there isn't one that works for all use cases.
kukkeliskuu
Around 20 years ago I was working for a database company. During that time, I attended SIGMOD, which is the top conference for databases.
The keynote speaker for the conference Stonebraker, who started Postgres, among other things. He talked about the history of relational databases.
At that time, XML databases were all the rage -- now nobody remembers them. Stonebraker explained that there is nothing new in the hierarchical databases. There was a significant battle in SIGMOD, I think somewhere in the 1980s (I forget the exact time frame) between network databases and relational databases.
The relational databases won that battle, as they have won against each competing hierarchical database technology since.
The reason is that relational databases are based on relational algebra. This has very practical consequences, for example you can query the data more flexibly.
When you use JSON storage such as MongoDB, when you decide your root entities you are stuck with that decision. I see very often in practice that there will always come new requirements that you did not foresee that you then need to work around.
I don't care what other people use, however.
zwnow
No, a database reflects what you make out of it. Reports are just queries after all. I dont know what all the other stuff you named has to do with the database directly. The only purpose of databases is to store and read data, thats what it comes down to. So query performance IS one of the most important metrics.
DonHopkins
You can always make your data bigger without increasing disk space or decreasing performance by making the font size larger!
querez
> The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.
Why do they use the geometric mean to average execution times?
ayhanfuat
It's a way of saying twice as fast and twice as slow have equal effect on opposite sides. If your baseline is 10 seconds, one benchmark takes 5 seconds, and another one takes 20 seconds then the geometric mean gives you 10 seconds as the result because they cancel each other. The arithmetic mean would treat it differently because in absolute terms 10 seconds slow down is bigger than 5 seconds speedup. But that is not fair for speedups because the absolute speedup you can reach is at most 10 seconds but slow down has no limits.
willvarfar
Squaring is a really good way to make the common-but-small numbers have bigger representation than the outlying-but-large numbers.
I just did a quick google and first real result was this blog post with a good explanation with some good illustrations https://jlmc.medium.com/understanding-three-simple-statistic...
Its the very first illustration at the top of that blog post that 'clicks' for me. Hope it helps!
The inverse is also good: mean-square-error is the good way for comparing how similar two datasets (e.g. two images) are.
yorwba
The geometric mean of n numbers is the n-th root of the product of all numbers. The mean square error is the sum of the squares of all numbers, divided by n. (I.e. the arithmetic mean of the squares.) They're not the same.
mediumsmart
I am on the late 2015 version and I have an ebay body stashed for when the time comes to refurbish that small data machine.
selimthegrim
Any good keywords to search?
drewm1980
I mean, not everyone spent their decade on distributed computing. Some devs with a retrogrouch inclination kept writing single threaded code in native languages on a single node. Single core clock speed stagnated, but it was still worth buying new CPU's with more cores because they also had more cache, and all the extra cores are useful for running ~other peoples' bloated code.
The R community has been hard at work on small data. I still highly prefer working on on memory data in R dplyr DataTable are elegant and fast.
The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.