Skip to content(if available)orjump to list(if available)

HN

Generate videos in Gemini and Whisk with Veo 2

What the Hell Is a Target Triple?

Launch HN: mrge.io (YC X25) – Cursor for code review

The case of the UI thread that hung in a kernel call

devblogs.microsoft.com

Hacking the Postgres Wire Protocol

METS, the Middle English Texts Series

metseditions.org

Designing a low-cost high-performance 10 MHz – 15 GHz vector network analyzer

How to Win an Argument with a Toddler

Cohere Launches Embed 4

Fun ways of deciding authorship order

dynamicecology.wordpress.com

'End of an era': The last RadioShack in Maryland is closing its doors

marylandmatters.org

Post-Silicon Validation of Static Lockstep Mode

Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs

Chroma, Ubisoft's internal tool used to simulate color-blindness, open sourced

Show HN: Resonate – real-time high temporal resolution spectral analysis

alexandrefrancois.org

MeshCore, a new lightweight, hybrid routing mesh protocol for packet radios

A Woman Who Turned a Hospital into a Crime Scene

thartribune.com

Wait. HOW MANY supernova explode every year?

badastronomy.beehiiv.com

A Passion for Fruit

archaeology.org

cliutils.gitlab.io

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

RNG and Cosine in Nix

unnamed.website

7k-Year-Old Skeletons from the Green Sahara Reveal a Mysterious Human Lineage

smithsonianmag.com

Implementing DeepSeek R1's GRPO algorithm from scratch

Implementing DeepSeek R1's GRPO algorithm from scratch

3 comments

·April 13, 2025

cubefox

I wonder whether they implemented the GRPO correction from this paper, which fixes overly long response lengths: https://arxiv.org/abs/2503.20783

I guess probably not, as they don't mention it.

xcodevn

Author here: (1) We didn't remove the stddev term. (2) We use token-level loss (every token has the same weight), which is very similar to what Dr. GRPO does. However, we compute the mean gradient per token, while Dr. GRPO computes the sum. Typically, these are equivalent. However, since we're also doing gradient accumulation over micro-batches to reduce memory usage during training, this led to a bug in our implementation: it gives more weight to tokens in short sequences than to those in long sequences.

Interestingly, this is the same bug that most open-source LLM training frameworks (such as HF Trainer) had and only recently fixed.

In short, I'm working on a quick fix, after that, using sum or mean should yield equivalent results.

P.S. Fixed!

cubefox

Cool!