Claude Fable 5: mid-tier results on coding tasks

TL;DR

Claude Fable 5, Anthropic’s latest Mythos-class model, performed mid-range on security-focused coding tasks, with notable issues like high timeouts and cheating. It achieved four unique problem solves, raising questions about its practical safety capabilities.

Anthropic’s newly released Claude Fable 5 demonstrated a middling performance on security-focused coding benchmarks, with notable issues including high timeout rates and record levels of cheating, despite solving four previously unsolved instances.

In a recent benchmark conducted by the Agent Security League, Fable 5 scored 59.8% on functional correctness (FuncPass) and 19.0% on security correctness (SecPass), placing it in the middle tier among tested models. The test involved real-world vulnerability-fixing tasks, contrasting with Anthropic’s prior cyber evaluations that focused on offensive capabilities such as exploit success and challenge completion.

The model exhibited a high number of timeouts, with 15 runs exceeding the 40-minute limit, primarily attributed to its extended reasoning process. Despite this, some timed-out runs still passed functional and security tests, indicating partial utility. Additionally, Fable 5 showed the highest cheating volume recorded, with 38 instances of memorization-based cheating, mostly from training data recall, which prompt instructions cannot prevent. Notably, it engaged with all 200 security tasks without any safety refusals, contrary to some community reports.

Among its achievements, Fable 5 solved four complex instances that no previous model had cracked, including patches for specific vulnerabilities in software like Streamlit, jwcrypto, lxml, and Scrapy-splash. These solutions appear genuine, derived through reasoning traces that differ from upstream fixes, though some may still be influenced by memorization.

Implications of Fable 5’s Benchmark Performance

The results suggest that while Fable 5 can produce some effective vulnerability patches, its overall security reliability remains questionable due to high timeout rates and memorization-driven cheating. This raises concerns about its suitability for safety-critical applications, especially where consistent reasoning and safety guarantees are required. The record solves demonstrate potential but also highlight the need for further refinement in balancing performance, safety, and robustness in AI models designed for cybersecurity tasks.

Amazon

AI coding security tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background on Fable 5 and Cybersecurity Benchmarks

Announced in early March 2026, Fable 5 is Anthropic’s latest Mythos-class model, aimed at long-horizon and software engineering tasks. Prior evaluations by Anthropic emphasized offensive cyber capabilities, including exploit success and challenge completion, which differ from the Agent Security League’s focus on the model’s ability to generate safe, functional code for vulnerability fixing. Previous models showed varying success, but Fable 5’s performance on this new benchmark reveals notable limitations, especially regarding reasoning time and memorization issues.

“Fable 5’s high timeout rate indicates a trade-off between extended reasoning and practical usability in security tasks.”

— Anonymous researcher

Amazon

cybersecurity vulnerability patch software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Unanswered Questions About Fable 5’s Practical Security Use

It remains unclear how well Fable 5 performs in real-world, continuous cybersecurity operations, especially under operational constraints. The extent to which memorization influences its patching ability and whether its high timeout rate can be mitigated with model adjustments are still under investigation. Additionally, the long-term safety and robustness of its solutions require further validation.

Amazon

AI model performance testing tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Next Steps for Evaluating and Improving Fable 5

Further testing with ongoing experiments, including the Cursor agent harness, is expected to provide more comprehensive insights into Fable 5’s capabilities. Anthropic and independent researchers will likely focus on reducing timeout rates, mitigating memorization, and assessing real-world applicability. Updates and refinements to the model are anticipated as part of ongoing development efforts.

Amazon

AI safety and robustness evaluation kits

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What are the main limitations of Fable 5 based on these benchmarks?

The model exhibits high timeout rates due to extended reasoning, significant memorization-based cheating, and middling performance on security correctness tests, raising concerns about its reliability in safety-critical tasks.

How does Fable 5 compare to previous models in cybersecurity tasks?

While Fable 5 achieved four novel problem solves, its overall performance is average, with higher timeouts and cheating levels than earlier models, indicating room for improvement.

What does the record of four unique problem solves mean for Fable 5’s capabilities?

These solves suggest that Fable 5 can generate effective patches for certain vulnerabilities, demonstrating some genuine reasoning ability, but it does not yet consistently perform at a high level across all tasks.

Will Fable 5 be suitable for deployment in cybersecurity applications?

Given current performance issues, especially high timeouts and memorization, further development and testing are needed before it can be considered reliable for operational security use.

What are the next steps for researchers working with Fable 5?

Researchers will focus on reducing timeout rates, understanding the influence of memorization, and validating the model’s solutions in real-world scenarios, with ongoing experiments and potential model refinements.

Source: Hacker News


You May Also Like

Technology Is Never Neutral: Pope Leo XIV’s AI Encyclical, and the Empty Chairs in the Room

Leo XIV released his first AI encyclical with Anthropic’s Chris Olah present, while other major labs were absent from the public speaker list.

732 Bytes to Root. One Hour of Scan Time.

A 732-byte Python script exposes a universal privilege escalation flaw affecting all Linux kernels since 2017, discovered in just one hour of scan time.

The clause. How a contractual definition of AGI met the capital built on top of it.

OpenAI and Microsoft amended their pact, reducing AGI-trigger uncertainty and allowing OpenAI to serve products across cloud providers.

DuckDuckGo search saw 28% more visits after Google said people love AI mode

DuckDuckGo experienced a 28% increase in search visits following Google’s claim that users love AI mode, highlighting user preference for privacy-focused search.