XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

91 comments

·June 24, 2025

hinterlands

Xbow has really smart people working on it, so they're well-aware of the usual 30-second critiques that come up in this thread. For example, they take specific steps to eliminate false positives.

The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.

But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.

I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.

normie3000

> Top infosec talent doesn't want to do it (and there's not enough of it).

What is the top talent spending its time on?

hinterlands

Vulnerability researchers? For public projects, there's a strong preference for prestige stuff: ecosystem-wide vulnerabilities, new attack techniques, attacking cool new tech (e.g., self-driving cars).

To pay bills: often working for tier A tech companies on intellectually-stimulating projects, such as novel mitigations, proprietary automation, etc. Or doing lucrative consulting / freelance work. Generally not triaging Nessus results 9-to-5.

tptacek

Specialized bug-hunting.

UltraSane

The best paying bug bounties.

bgwalter

Maybe that is because the article is chaotic (like any "AI" article) and does not really address the false positive issue in a well.presented manner? Or even at all?

Below people are reading the tea leaves to get any clue.

Sytten

100% agree with OP, to make a living in BBH you can't go hunting on VDP program that don't pay anything all day. That means you will have a lot of low hanging fruits on those programs.

I don't think LLM replace humans, they do free up time to do nicer tasks.

absurdo

> so they're well-aware of the usual 30-second critiques that come up in this thread.

Succinct description of HN. It’s a damn shame.

moktonar

While impressive, a lot of manual human work was involved both to filter the input and the output, this is not a “fully” automated workflow, sorry. But, yeah, kudos to them.

tecleandor

First:

> To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.

Is it dogfooding if you're not doing it to yourself? I'd considerit dogfooding only if they were flooding themselves in AI generated bug reports, not to other people. They're not the ones reviewing them.

Also, honest question: what does "best" means here? The one that has sent the most reports?

jamessinghal

Their success rates on HackerOne seem widely varying.

  22/24 (Valid / Closed) for Walt Disney

  3/43 (Valid / Closed) for AT&T

pclmulqdq

Walt Disney doesn't pay bug bounties. AT&T's bounties go up to $5k, which is decent but still not much. It's possible that the market for bugs is efficient.

monster_truck

Walt Disney's program covers substantially more surface area, there's 6? publicly traded companies listed there. In addition to covering far fewer domains & apps, AT&T's conditions and exclusions disqualify a lot more.

The market for bounties is a circus, breadcrumbs for free work from people trying to 'make it'. It can safely be analogized to the classic trope of those wanting to work in games getting paid fractional market rates for absurd amounts of QA effort. The number of CVSS vulns with a score above 8 that have floated across the front page of HN in the past year without anyone getting paid tells you that much.

thaumasiotes

> Their success rate on HackerOne seems widely varying.

Some of that is likely down to company policies; Snapchat's policy, for example, is that nothing is ever marked invalid.

jamessinghal

Yes, I'm sure anyone with more HackerOne experience can give specifics on the companies' policies. For now, those are the most objective measures of quality we have on the reports.

inhumantsar

I think they mean dogfooding as in putting on the "customer" hat and using the product.

Seems reasonable to call that dogfooding considering that flooding themselves wouldn't be any more useful than synthetic testing and there's only so much ground they could cover using it on their own software.

If this were coming out of Microsoft or IBM or whatever then yeah, not really dogfooding.

skeptrune

I'm confused on whether or not this actually outperformed humans. The more interesting statistic would be how much money it made versus the average hacker one top ranked contributor.

mellosouls

Have XBow provided a link to this claim, I could only find:

https://hackerone.com/xbow?type=user

Which shows a different picture. This may not invalidate their claim (best US), but a screenshot can be a bit cherry-picked.

zndr

If you scroll down on [the leaderboard](https://hackerone.com/leaderboard?year=2025&quarter=2&owasp=...) page to Country and select United States, xbow is currently on top

mellosouls

Ah thanks, I think it would be useful for them to perhaps add it as a footnote or something.

Sytten

Since I am the cofounder of a mostly manual based testing in that space we do follow the new AI hackbots closely. There is a lot of money being raised (Horizon3 at 100M, Xbow at 87M, Mindfort will probably soon raise).

The future is definitely a combination of human and bots like anything else, it won't replace the humans just like coding bots won't replace devs. In fact this will allow humans to focus ob the fun/creative hacking instead of the basic/boring tests.

What I am worried about is on the triage/reproduction side, right now it is still mostly manual and it is a hard problem to automate.

jp0001

I want to know how much they made in bounties versus how much they spent on compute.

The thing about bug bounties, the only way to win is to not play the game.

martinald

This does not surprise me. In a couple of 'legacy' open source projects I found DoS attacks within 10 minutes, with a working PoC. It crashed the server entirely. I suspect with more prompting it could have found RCE but it was an idle shower thought to try.

While niche and not widely used; there are at least thousands of publicly available servers for each of these projects.

I genuinely think this is one of the biggest near term issues with AI. Even if we get great AI "defence" tooling, there are just so many servers and (IoT or otherwise) devices out there, most of which is not trivial to patch. While a few niche services getting pwned isn't probably a big deal, a million niche services all getting pwned in quick succession is likely to cause huge disruption. There is so much code out there that hasn't been remotely security checked.

Maybe the end solution is some sort of LLM based "WAF" that inspects all traffic that ISPs deploy.

mkagenius

> XBOW submitted nearly 1,060 vulnerabilities.

Yikes, explains why my manually submitted single vulnerability is taking weeks to triage.

tptacek

The XBOW people are not randos.

lcnPylGDnU4H9OF

That's not their point, I think. They're just saying that those nearly 1060 vulnerabilities are being processed so theirs is being ignored (hence "triage").

tptacek

If that's all they're saying then there isn't much to do with the sentiment; if you're legit-finding #1061 after legit-findings #1-#1060, that's just life in the NFL. I took instead the meaning that the findings ahead of them were less than legit.

chc4

I'm generally pretty bearish on AI security research, and think most people don't know anything about what they're talking about, but XBOW is frankly one of the few legitimately interesting and competent companies in the space, and their writeups and reports have good and well thought out results. Congrats!

null

[deleted]

ryandrake

Receiving hundreds of AI generated bug reports would be so demoralizing and probably turn me off from maintaining an open source project forever. I think developers are going to eventually need tools to filter out slop. If you didn’t take the time to write it, why should I take the time to read it?

moyix

All of these reports came with executable proof of the vulnerabilities – otherwise, as you say, you get flooded with hallucinated junk like the poor curl dev. This is one of the things that makes offensive security an actually good use case for AI – exploits serve as hard evidence that the LLM can't fake.

tptacek

These aren't like Github Issues reports; they're bug bounty programs, specifically stood up to soak up incoming reports from anonymous strangers looking to make money on their submissions, with the premise being that enough of those reports will drive specific security goals (the scope of each program is, for smart vendors, tailored to engineering goals they have internally) to make it worthwhile.

ryandrake

Got it! The financial incentive will probably turn out to be a double edged sword. Maybe in the pre-AI age, it’s By Design to drive those goals, but I bet the ability to automate submissions will inevitably alter the rules of these programs.

I think within the next 5 years or so, we are going to see a societal pattern repeating: any program that rewards human ingenuity and input will become industrialized by AI to the point where it becomes a cottage industry of companies flooding every program with 99% AI submissions. What used to be lone wolves or small groups of humans working on bounties will become truckloads of AI generated “stuff” trying to maximize revenue.

dcminter

I'm wary of a lot of AI stuff, but here:

> What used to be lone wolves or small groups of humans working on bounties will become truckloads of AI generated “stuff” trying to maximize revenue.

You're objecting to the wrong thing. The purpose of a bug bounty programme is not to provide a cottage industry for security artisans - it's to flush out security vulnerabilities.

There are reasonable objections to AI automation in this space, but this is not one of them.

t0mas88

Might be fixable by adding a $ 100 submission fee that is returned when you're proving working exploit code. Would make the Curl team a lot of money.

triknomeister

Eventually projects who can afford the smugness are going to charge people to be able to talk to open source developers.

tough

isnt that called enterprise support / consulting

triknomeister

This is without the enterprise.

null

[deleted]

bawolff

If you think the AI slop is demoralizing, you should see the human submissions bug bounties get.

There is a reason companies like hackerone exist - its because dealing with the submissions is terrible.

Nicook

Open source maintainers have been complaining about this for a while. https://sethmlarson.dev/slop-security-reports. I'm assuming the proliferation of AI will have some significant changes on/already has had for open source projects.

jgalt212

One would think if AI can generate the slop it could also triage the slop.

err4nt

How does it know the difference?

scubbo

I'm still on the AI-skeptic side of the spectrum (though shifting more towards "it has some useful applications"), but, I think the easy answer is - if different models/prompts are used in generation than in quality-/correctness-checking.

jgalt212

I think Claude, given enough time to mull it over, could probably come up with some sort of bug severity score.

teeray

You see, the dream is another AI that reads the report and writes the issue in the bug tracker. Then another AI implements the fix. A third AI then reviews the code and approves and merges it. All without human interaction! Once CI releases the fix, the first AI can then find the same vulnerability plus a few new and exciting ones.

dingnuts

This is completely absurd. If generating code is reliable, you can have one generator make the change, and then merge and release it with traditional software.

If it's not reliable, how can you rely on the written issue to be correct, or the review, and so how does that benefit you over just blindly merging whatever changes are created by the model?

tempodox

Making sense is not required as long as “AI” vendors sell subscriptions.

croes

That’s why parent wrote it’s a dream.

It’s not real.

But you can bet someone will sell that as the solution.