ASTRA: HackerRank's coding benchmark for LLMs
6 comments
·February 11, 2025sosuke
No huggingface models or did I just miss them? Edit: they mention doing open models at some point at the bottom of the page
bobnamob
Seems like a very limited subset of software development to be basing a benchmark on
Where’s the kernel dev? Where’s the embedded dev? Where’s the throwaway python script?
rokhayakebe
How will programming change when we reach reach 99-100%?
Bjorkbat
My belief is that software engineering benchmarks are still a poor proxy for performance on real world software engineers tasks, and that there's a decent chance a new model might saturate a benchmark while being kind of underwhelming.
A simple example, if a human scored 50% on SWE-bench verified, it's fair to say that this person is a very competent software engineer. Popular frontier models like Claude Sonnet and OpenAI's o3 can score 50% on SWE-bench, and can score even higher with special tooling, but compared to an actual human software engineer can't seem to competently perform a lot of programming tasks on their own.
Although, if a model did consistently score more than 99% on various software engineering benchmarks that might be different, as it would imply a very real sense of competence. That's a pretty substantial if though. To my knowledge there isn't a single model out there that can consistently score more than 99% on any given benchmark. The o1 model scored very well on certain MMLU categories, 98.1% on college mathematics, but I'm not sure if this result will continue to hold on a similar benchmark evaluating college-level undergraduate mathematics.
Also, something else to consider, we take for granted how often we're able to perform tasks with more than 99% accuracy and how quickly things would fall apart if this weren't the case. If the average human driver was only able to make an accident-free trip only 99% of the time that would imply that they'd get in a wreck every 100th time they drive their car.
Granted, software engineering might be the exception to this rule, but then again, depends on what you're measuring. When it comes to more-or-less discrete steps, we're arguably pretty good at writing programs that capture our intent, and I could foresee an AI that only gets this right 99% of the time to be a pain in the ass to work with. If a feature ticket requires 10 different sub-tasks to be done correctly, then an AI that can do each sub-task correctly 99% of the time has a roughly 90% chance of doing the whole feature ticket correctly, which is still good but compounded over many feature tickets could be exhausting to deal with. An AI that has only a 90% chance of doing each sub-task correctly would almost certainly fail to implement this hypothetical feature ticket.
Mind you, statistics is not my domain so if there are any errors in my logic please correct me.
CharlieDigital
My take is that teams should start to think about biasing for selecting for code review skills instead of pure coding skills.
The AI is going to significantly improve coding output, but at least for a while , we're still going to need human shepherds to make a call on quality, check for performance, security, and conformance to the bigger picture of the system. Maybe longer than we think given that we still have conductors and pilots.
The startup I was at just wound down last December and I interviewed with a handful of companies. None had any code reviews as part of their process, even though they probably already have engineers using Copilot or Cursor.
There's only been one company, a YC startup, that I interviewed with that incorporated a code review (as the first round no less). I ended up creating a lightweight, open source app[0] for teams to incorporate code reviews into the interview process more easily after really enjoying that process with the startup.
QuadmasterXLII
Today I, in a fit of laziness, asked claude for a c function to invert a matrix instead of writing it myself. It gave me a function that is wrong if malloc gives a pointer to non-zeroed memory. If its 99% and the 1% continue to be mistakes like that, programming is going to be hell.
We help companies hire & upskill developers. A customer recently asked: What % of HackerRank problems can LLMs solve? That got us thinking—how should hiring evolve when AI can translate natural language to code?
Our belief: AI will handle much of code generation, so developers will be assessed more on SDLC skills with AI assistants.
To explore this, we’re benchmarking LLMs on real-world software dev scenarios—starting with 65 unseen problems across 10 domains. Beyond correctness, we evaluated consistency—an often overlooked aspect of AI reliability. We’re open-sourcing the dataset on Huggingface and expanding it to cover more domains, ambiguous specs, and harder challenges.
Would love the HN community’s take on this!