We Can Just Measure Things
19 comments
·June 17, 2025layer8
Marazan
I would argue you can't compare between projects due to the reasons you state. But you can try and improve the metrics within a single project.
Cycolmatic complexity is a terrible metric to obsesses over yet in a project I was on it was undeniably true that the newer code written by more experienced Devs was both subjectively nicer and also had lower cycolmatic complexity than the older code worked on by a bunch of juniors (some of the juniors had then become some of the experienced Devs who wrote the newer code)
null
ToucanLoucan
Still RTFA but this made me rage:
> In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong
I recently implemented Microsoft's MSAL authentication on iOS which includes as you might expect a function that retrieves the authenticated accounts. Oh sorry, I said function, but there's two actually: one that retrieves one account, and one that retrieves multiple accounts, which is odd but harmless enough right?
Wrong, because whoever designed this had an absolutely galaxy brained moment and decided if you try and retrieve one account when multiple accounts are signed in, instead of, oh I dunno, just returning an error message, or perhaps returning the most recently used account, no no no, what we should do in that case is throw an exception and crash the fucking app.
I just. Why. Why would you design anything this way!? I can't fathom any situation you would use the one-account function in when the multi-account one does the exact same fucking thing, notably WITHOUT the potential to cause a CRASH, and just returns a set of one, and further, why then if you were REALLY INTENT ON making available one that only returned one, it wouldn't itself just call the other function and return Accounts.first.
</ rant>
layer8
How is an exception different from “returning an error message”?
dewey
Seems like the main differentiator is that one crashed and one doesn’t. Unrelated to error message or exception.
layer8
I understood “crashing” as them not catching the exception.
Most functions can fail, and any user-facing app has to be prepared for it so that it behaves gracefully towards the user. In that sense I agree that the error reporting mechanism doesn’t matter. It’s unclear though what the difference was for the GP.
johnmaguire
I'm not sure I understand how both occurred at once. Typically an uncaught exception will result in a crash, but this would generally be considered an error at the call site (i.e. failing to handle error conditions.)
ToucanLoucan
For one: terminating execution
More importantly: why is having more than one account an "exception" at all? That's not an error or fail condition, at least in my mind. I wouldn't call our use of the framework an edge case by any means, it opens a web form in which one puts authentication details, passes through the flow, and then we are given authentication tokens and the user data we need. It's not unheard of for more than one account to be returned (especially on our test devices which have many) and I get the one-account function not being suitable for handling that, my question is... why even have it then, when the multi-account one performs the exact same function, better, without an extra error condition that might arise?
kfajdsl
> For one: terminating execution
Seems like you should have a generic error handler that will at a minimum catch unexpected, unhandled exceptions with a 'Something went wrong' toast or similar?
TOGoS
> why is having more than one account an "exception" at all? That's not an error or fail condition
It is if the caller is expecting there to be exactly one account.
This is why I generally like to return a set of things from any function that might possibly return zero or more than one things. Fewer special cases that way.
But if the API of the function is to return one, then you either give one at random, which is probably not right, or throw an exception. And with the latter, the person programming the caller will be nudged towards using the other API, which is probably what they should have done anyway, and then, as you say, the returns-one-account function should probably just not exist at all.
null
lostdog
A lot of the "science" we do is experimenting on bunches of humans, giving them surveys, and treating the result as objective. How many places can we do much better by surveying a specific AI?
It may not be objective, but at least it's consistent, and it reflects something about the default human position.
For example, there are no good ways of measuring the amount of technical debt in a codebase. It's such a fuzzy question that only subjective measures work. But what if we show the AI one file at a time, ask "Rate, 1-10, the comprehensibility, complexity, and malleability of this code," and then average across the codebase. Then we get measure of tech debt, which we can compare over time to measure if it's rising or falling. The AI makes subjective measurements consistent.
This essay gives such a cool new idea, while only scratching the surface.
delusional
> it reflects something about the default human position
No it doesn't. Nothing that comes out of an LLM reflects anything except the corpus it was trained on and the sampling method used. That definitionally true, since those are the very things it is a product of.
You get NO subjective or objective insight from asking the AI about "technical debt" you only get an opaque statistical metric that you can't explain.
BriggyDwiggs42
If you knew that the model never changed it might be very helpful, but most of the big providers constantly mess with their models.
cwillu
Even if you used a local copy of a model, it would still just be a semi-quantitative version of “everyone knows ‹thing-you-don't-have-a-grounded-argument-for›”
layer8
Their performance also varies depending on load (concurrent users).
We can just measure things, but then there’s Goodhart's law.
With the proposed way of measuring code quality, it’s also unclear how comparable the resulting numbers would be between different projects. If one project has more essential complexity than another project, it’s bound to yield a worse score, even if the code quality is on par.