California invests in battery energy storage, leaving rolling blackouts behind
latimes.com
The Journey Before main()
amit.prasad.me
Show HN: Diagram as code tool with draggable customizations
github.com
How programs get run: ELF binaries (2015)
lwn.net
Show HN: Shadcn/UI theme editor – Design and share Shadcn themes
shadcnthemer.com
Agent Lightning: Train agents with RL (no code changes needed)
github.com
ARM Memory Tagging: how it improves C/C++ memory safety (2018) [pdf]
llvm.org
An Efficient Implementation of SELF (1989) [pdf]
courses.cs.washington.edu
In memory of the Christmas Island shrew
news.mongabay.com
Rock Tumbler Instructions
rocktumbler.com
Testing out BLE beacons with BeaconDB
blog.matthewbrunelle.com
AI, Wikipedia, and uncorrected machine translations of vulnerable languages
technologyreview.com
Belittled Magazine: Thirty years after the Sokal affair
thebaffler.com
Show HN: LLM Rescuer – Fixing the billion dollar mistake in Ruby
github.com
Load-time relocation of shared libraries (2011)
eli.thegreenplace.net
The Cooperative National Geologic Map
ngmdb.usgs.gov
Project Amplify: Powered footwear for running and walking
about.nike.com
Passwords and Power Drills
google.github.io
Jacqueline – A minimal i386 kernel written in Pascal (2019)
github.com
Making a micro Linux distro (2023)
popovicu.com
TLDR: I’m expanding the family of text-splitting Chonky models with new multilingual model.
You can learn more about this neural approach in a previous post: https://news.ycombinator.com/item?id=43652968
Since the release of the first distilbert-based model I’ve released two more models based on a ModernBERT. All these models were pre-trained and fine-tuned primary on English texts.
But recently mmBERT(https://huggingface.co/blog/mmbert) has been released. This model pre-trained on massive dataset that contains 1833 languages. So I had an idea of fine-tuning a new multilingual Chonky model.
I’ve expanded training dataset (that previously contained bookcorpus and minipile datasets) with Project Gutenberg dataset which contains books in some widespread languages.
To make the model more robust for real-world data I’ve removed punctuation for last word for every training chunk with probability of 0.15 (no ablation was made for this technique though).
The hard part is evaluation. The real-world data are typically OCR'ed markdown, transcripts of calls, meeting notes etc. and not a clean book paragraphs. I didn’t find such labeled datasets. So I used what I had: already mentioned bookcorpus and Project Gutenberg validation, Paul Graham essays, concatenated 20_newsgroups.
I also tried to fine-tune the bigger mmBERT model (mmbert-base) but unfortunately it didn’t go well — metrics are weirdly lower in comparison with a small model.
Please give it a try. I'll appreciate a feedback.
The new multilingual model: https://huggingface.co/mirth/chonky_mmbert_small_multilingua...
All the Chonky models: https://huggingface.co/mirth
Chonky wrapper library: https://github.com/mirth/chonky