Contact Info

Location 24 Holborn Viaduct, London EC1A 2BN

Follow Us

Project Glasswing and the End of the Annual Pen Test

Project Glasswing is everything Anthropic said it was. A frontier-capable model, gated behind a coalition of fifty organisations and $100m in controlled-access credits, turned loose on decades of open-source infrastructure and immediately found bugs that automated testing had missed roughly five million times over. A 27-year-old flaw in OpenBSD. A 16-year-old one in FFmpeg. Alex Albert, Anthropic’s head of developer relations, called it “possibly the most consequential event in the AI industry I’ve seen up close since joining Anthropic.” The New York Times’ Kevin Roose described the underlying model, Claude Mythos, as “so powerful that Anthropic is not releasing it to the public.” Both framings are accurate.

And none of it changes the fact that most of the software your customers touch this quarter is not OpenBSD and is not FFmpeg. It’s the bespoke Laravel app your team shipped last month, the SaaS integration somebody wired up over a weekend, and the admin console that was never supposed to be externally reachable but now is. That’s what an application-layer pen test is for. Glasswing doesn’t test it. Mythos doesn’t test it. Nothing Anthropic is announcing this spring tests it.

So the question UK security teams are starting to ask out loud is the right one: if frontier models are compressing the window between “bug exists” and “bug is being exploited” from months to minutes, what does that do to the cadence and shape of the work we were already doing?

The clock on every known CVE just got faster

Start with what Glasswing actually did. It didn’t invent new vulnerability classes. It re-read code that had been sitting in the open for decades, noticed things fuzzers and linters hadn’t, and wrote proof-of-concept exploits fast enough that the only remaining constraint was compute. That’s the interesting part. The findings themselves are almost beside the point.

The implication — the one Anthropic is tiptoeing around, reasonably, because they don’t want to arm the world — is that whatever Glasswing can do on BSD source, a less-gated model will be able to do on your public repositories within a year. Probably less. The open-source frontier tends to run eighteen months behind the labs, sometimes closer, and there are already research teams publishing on automated exploit generation at a pace that suggests the methodology is general, not BSD-shaped.

What that looks like in practice: a CVE landing against a library you depend on used to be a thing your vendor bulletin flagged, your team reviewed on Tuesday, and somebody patched by the end of the sprint. That window — call it the half-life of a zero-day — was already measured in days, not weeks. Glasswing is the public confirmation that it’s now measured in hours.

None of that is about your bespoke code. Yet.

Glasswing tests the platform. Your pen test tests the promise.

An application-layer pen test does not look for buffer overflows in libc. It looks at the contract your product makes with its users and checks whether the contract holds. That is a completely different job, and it is one no foundation model can do from the outside without the things only you have: a set of authenticated sessions, a multi-tenant data model, a threat model, and a list of features you’ve made explicit promises about.

A Glasswing-class model can tell you whether your Apache build has a known issue. It cannot tell you:

  • Whether your SaaS’s “organisation” boundary leaks when a user is moved between tenants, and whether soft-deleted members can still read last month’s invoices
  • Whether your admin-impersonation feature logs consistently across every downstream service, or whether one of them silently attributes the request to the impersonated user
  • Whether the OAuth scopes you granted your Copilot-class integrations are enforced on the back end, or whether the front end is the only thing stopping someone from pulling everything
  • Whether your “soft delete” actually deletes the row you’d need to be gone for GDPR Article 17, rather than just flipping a visibility flag
  • Whether the rate limiter on /login also applies to the password-reset flow that issues a token over email
  • Whether a tester using your public API, with a valid but low-privilege token, can reach a workflow that was supposed to be admin-only because one endpoint forgot to re-check the role

These are the bugs that turn into breaches. They are not in any CVE database. They will not be found by any model — frontier or otherwise — without the same authenticated, in-context access the tester has. And they are specific to the product you built, in the shape you built it.

The annual pen test was already a snapshot. Now it’s a slower snapshot.

Here’s the uncomfortable part. The traditional UK cadence — one CREST-aligned engagement a year, a couple of retests, maybe a scope refresh before a major release — was built for a world where the attack surface you signed off in January was still broadly the attack surface you were running in November.

That world is going away. Not because Glasswing exists, but because everything downstream of Glasswing is going to.

Two things are happening in parallel. First, your own team is shipping faster. If your developers use Copilot-class suggestion tools inside the IDE, the rate of code reaching production has gone up, and the distribution of that code has changed. We wrote about this in AI-generated code security risks: it is not that AI code is worse, it is that it repeats a small set of plausible-looking mistakes at volume. The shape of the bug you would find in month nine is no longer the shape of the bug you would have found in month two.

Second, the attacker’s tooling has improved faster than your testing cadence. A reconnaissance run that used to take a bored student a weekend now takes a scheduler and a mid-tier model roughly twelve minutes. Enumeration, credential-stuffing, parameter-fuzzing, even basic authorisation-logic probing — all of it is automated to a degree that was theoretical two years ago.

A snapshot test that reflects January’s attack surface, evaluated with January’s attacker tooling, is a thinner guarantee in April than it used to be. By October it’s closer to a receipt.

What “tested” has to mean in 2026

The honest reframing — and it is one a lot of CREST firms, us included, are now having with clients — is that the annual engagement is a floor, not a ceiling. It is what you do to prove you have a programme. It is not the programme.

A programme that holds up against the new clock looks something like this:

  • Scope that reflects the product today, not the product you sold a quote on. If you have shipped a new SSO integration, an admin panel, an API v2, or a mobile client since your last engagement, your scope is not the same scope. It does not matter if the SOW was signed.
  • Retests that validate controls, not just close findings. Finding a bug, fixing it, and retesting the fix is table stakes. Retesting whether the control that should have caught the bug is now actually in place across the surface — that is what you’re paying for.
  • Continuous, lightweight coverage between engagements. This does not have to be expensive. A quarterly external attack-surface sweep, a monthly authenticated differential scan against the live app, and a documented path for dev to request a targeted review when a feature touches a sensitive boundary will do more than most organisations’ annual test on its own.
  • A threat model your team can read without a translator. If the model in your tester’s head only lives in a PDF your developers never open, the findings will not land. The pen tests that move the needle in 2026 are the ones where the same document gets updated by both sides.
  • Explicit guidance on AI-generated code paths. This is where the Glasswing-era threat model touches your product directly. If Copilot, Cursor, or a Claude Code agent contributed meaningfully to a feature, note it — not to blame AI, but to flag that the bug distribution on that PR is different and the review criteria should be different.

None of that is a sales pitch for more pen testing. Half of it is a sales pitch for less, better-targeted testing and better instrumentation in between. That is usually what we end up recommending when somebody brings us a three-year-old SOW and asks us to retest “everything”.

The part Anthropic got right, and the part that is still on you

Project Glasswing is a genuine contribution. Using frontier AI to find decades-old vulnerabilities in critical open-source infrastructure, before the vulnerabilities find someone else first, is exactly the kind of asymmetric defence the industry has been promising itself for years. The coalition model — fifty organisations, controlled credits, responsible-disclosure pipelines — is the right posture for a tool this powerful. Gating Mythos itself is the right call.

But Glasswing is a platform-layer story, and the platform you run on has an owner. That owner is not you. Red Hat, Debian, the Apache Foundation, PostgreSQL, the people who maintain libc, the people who maintain OpenSSH — they are the counterparties for what Glasswing produces. Your job is to make sure that when those patches ship, you apply them quickly, on infrastructure you understand, in a window that reflects how fast the clock has gotten.

The application layer is yours. The business logic is yours. The authorisation model that decides whether user alice@tenant-A can read the invoice belonging to tenant B — nobody else can test that for you. No model currently announced can test it for you. It is the same job it was five years ago, executed under a shorter clock, with a bigger cost when you get it wrong.

A pen test did not become obsolete when Glasswing launched. It became overdue faster.