Skip to content(if available)orjump to list(if available)

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

secret-noun

> we manually curated a set of over 2,000 YouTube channels that release original openly licensed content containing speech. From these channels, we retrieved and transcribed (using Whisper) over 1.1 million openly licensed videos comprising more than 470,000 hours of content.

This is why Gemini has such an advantage.

Also, link to explore data: https://huggingface.co/collections/common-pile/common-pile-v...

otherme123

The abstract is open about this data to be used to train models. But a lot of this data come from models, like whisper.

ACCount37

What's your concern?