TL;DR
Recent observations indicate that the newest Anthropic language models are increasingly generating malformed tool calls, especially in complex multi-step interactions. This decline in accuracy contrasts with older models and may impact AI tool integration.
Recent testing shows that the latest Anthropic language models, including Opus 4.8 and Sonnet 5, are increasingly generating malformed tool calls, especially with complex nested schemas. This development raises questions about the reliability of these models in practical tool integration, contrasting with older models that perform more accurately.
The issue was identified through detailed analysis of model outputs during multi-turn interactions involving file editing tasks. The newer models often produce tool call payloads with invented, nonsensical fields such as type, id, and notes, which cause the tool invocation to fail. This problem is more prevalent in models like Opus 4.8 and Sonnet 5, compared to earlier versions, which tend to produce cleaner, schema-compliant calls.
The failure appears to be context-dependent, with complex interactions involving file diagnostics and multi-step reasoning more likely to trigger malformed calls. Researchers note that the actual payloads, such as oldText and newText, are often correct, but extraneous keys are added, leading to validation errors. The phenomenon seems to worsen with increased model sophistication, possibly due to training artifacts related to the inclusion of code and tool-harness data during post-training.
Implications for AI Tool Reliability and Development
This trend suggests that advancements in model complexity may come with unintended side effects, such as decreased accuracy in tool invocation. As models like Opus 4.8 and Sonnet 5 increasingly generate malformed calls, the reliability of AI systems that depend on precise tool interactions could be compromised, impacting applications from coding assistants to automated editing tools.
The deterioration indicates potential issues in the training process or schema enforcement mechanisms, raising concerns about future scalability and safety of AI systems integrated with external tools. Developers may need to refine training methods or implement stricter validation to prevent such errors.

AI Programming Made Practical: A Step-by-Step Guide to Building AI-Powered Applications, Writing Better Code Faster, and Using Modern AI Tools with Confidence
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Evolution of AI Tool Call Handling in Anthropic Models
Earlier versions of Anthropic’s models demonstrated relatively high accuracy in producing tool calls that adhered to predefined schemas, with minimal malformed outputs. These models were trained on datasets that included examples of tool invocation but did not incorporate complex post-training code or harnesses. Recent models, however, have incorporated more sophisticated training data, including code and tool-harness environments like Claude Code, which may have introduced new patterns of errors.
The shift towards models trained with code-like data and integrated tool-harnesses has coincided with the observed decline in schema compliance, especially in multi-turn, context-rich interactions. This suggests that the training environment and post-processing methods significantly influence the models’ ability to generate valid tool calls.
“The newer models are producing more nonsensical keys in their tool calls, which was not an issue with earlier versions. It’s likely a training artifact.”
— Anonymous researcher

Secure AI Agents with LangChain, MCP, and Tool-Using LLMs: A Developer’s Guide to Safe Invocation,Prompt Defense, and Context-Aware Generative Workflows
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Unclear Causes Behind the Decline in Tool Call Accuracy
It is not yet confirmed whether the deterioration is due to model training artifacts, schema enforcement issues, or other factors such as changes in post-training environments. The exact mechanisms causing the addition of nonsensical keys remain under investigation.
Researchers are still analyzing whether this is a temporary artifact or indicative of a broader trend affecting future model versions.

Mastering Cursor AI Coding: Learn Prompting, Code Generation, Testing, Debugging, Refactoring, DevOps, and Real Project
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Next Steps for Addressing Model Tool Call Failures
Researchers plan to conduct systematic testing across different model versions and training setups to identify the root causes. There is also an ongoing effort to improve schema validation and constrain model outputs through grammar-aware decoding techniques.
Model developers may release updates or patches to mitigate these issues, and further transparency on training data and post-processing methods is expected to clarify the problem’s scope.

Software Testing with Generative AI
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
Why are newer models producing more malformed tool calls?
It is believed to be related to changes in training data, especially the inclusion of code and tool-harness environments, which may introduce new error patterns or schema violations.
Does this affect all AI models or only specific versions?
The issue appears to be more prominent in the latest versions like Opus 4.8 and Sonnet 5, while earlier models tend to produce cleaner outputs. The phenomenon is still under investigation.
What are the practical implications of this decline?
Decreased accuracy in tool invocation can impact AI applications that depend on precise tool calls, such as coding assistants, automated editors, and other productivity tools, potentially reducing their reliability.
Can this issue be fixed through better training or validation?
Yes, ongoing efforts include refining training datasets, applying grammar-aware decoding, and implementing stricter validation to improve tool call correctness in future models.
Source: Hacker News