Skip to content(if available)orjump to list(if available)

HN

Nine Things I Learned in Ninety Years

edwardpackard.com

Zoxide: A Better CD Command

Go has added Valgrind support

go-review.googlesource.com

Altoids by the Fistful

scottsmitelli.com

Qwen3-Omni: Native Omni AI model for text, image and video

Delete FROM users WHERE location = 'Iran';

gist.github.com

Fall Foliage Map 2025

explorefall.com

Themis (European Reusable Rocket) is assembled on launch pad

I built a dual RTX 3090 rig for local AI in 2025 (and lessons learned)

Gamebooks and graph theory (2019)

notes.atomutek.org

The YAML Document from Hell

ruudvanasseldonk.com

Processing Strings 109x Faster Than Nvidia on H100

ashvardanian.com

Telli (YC F24) is hiring ambitious engineers [Berlin, on-site]

Cap'n Web: a new RPC system for browsers and web servers

blog.cloudflare.com

Compiling a Functional Language to LLVM (2023)

danieljharvey.github.io

Indoor surfaces act as sponges for harmful chemicals

I'm spoiled by Apple Silicon but still love Framework

simonhartcher.com

Paper2Agent: Stanford Reimagining Research Papers as Interactive AI Agents

Why haven't local-first apps become popular?

marcobambini.substack.com

The Beginner's Textbook for Fully Homomorphic Encryption

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Kevo app shutdown

Is a movie prop the ultimate laptop bag?

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

3 comments

·September 19, 2025

secret-noun

> we manually curated a set of over 2,000 YouTube channels that release original openly licensed content containing speech. From these channels, we retrieved and transcribed (using Whisper) over 1.1 million openly licensed videos comprising more than 470,000 hours of content.

This is why Gemini has such an advantage.

Also, link to explore data: https://huggingface.co/collections/common-pile/common-pile-v...

otherme123

The abstract is open about this data to be used to train models. But a lot of this data come from models, like whisper.

ACCount37

What's your concern?