Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

TL;DR

Orthrus-Qwen3 introduces a dual-architecture framework that enables up to 7.8× faster token generation without sacrificing output fidelity. It combines autoregressive and diffusion models, maintaining exact distribution while improving efficiency.

Orthrus-Qwen3, a new dual-architecture framework for large language models, has been announced, achieving up to 7.8 times faster token generation while maintaining exact output distribution. This development is significant for AI research and deployment, offering both speed and fidelity improvements.

Orthrus-Qwen3 employs a dual-view diffusion approach that unifies the exact generation fidelity of autoregressive LLMs with the high-speed parallel token generation of diffusion models. All models use a Qwen3 backbone and guarantee strictly lossless output, matching the original model’s predictive distribution.

The framework achieves a speedup of up to 7.8× in inference, surpassing previous methods like speculative decoding, and does so with zero redundant memory overhead. This is accomplished by sharing an exact high-fidelity Key-Value cache across both views, resulting in only an O(1) memory overhead. Orthrus fine-tunes only 16% of the total model parameters, keeping the base LLM frozen, which enhances parameter efficiency.

Compared to other state-of-the-art diffusion models, Orthrus maintains high accuracy and fidelity, especially on complex reasoning tasks. It outperforms models like EAGLE-3 and DFlash in token throughput and inference speed, particularly as context length scales.

Why It Matters

This development matters because it addresses longstanding challenges in LLM inference—speed, efficiency, and fidelity—simultaneously. The ability to generate tokens faster without losing accuracy can accelerate AI deployment in real-time applications, reduce computational costs, and improve scalability for large models. It also opens new avenues for research into combining autoregressive and diffusion techniques effectively.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Prior to Orthrus-Qwen3, diffusion models offered parallel decoding but often suffered from accuracy degradation and conditional drift, limiting their use in precise language tasks. Traditional autoregressive models, while accurate, are slow due to their sequential nature. Recent efforts focused on speculative decoding and other approximations to improve speed, but these often introduced redundancy and fidelity issues. Orthrus builds on recent advances in diffusion and caching mechanisms to overcome these limitations, representing a significant step forward in the field.

“Orthrus-Qwen3 demonstrates that combining dual-view diffusion with exact caching can deliver both speed and fidelity at scale.”

— Chien Van Nguyen, lead researcher

“Our framework guarantees strictly lossless generation while achieving unprecedented inference acceleration.”

— Official Orthrus announcement

Enhanced OBD2 Scanner Diagnostic Tool - Check Engine Code Reader with Battery Tester, EVAP, Live Data, Mode 6/8, Professional Car Scan Tool with Freeze Frame, DTC Lookup for All OBDII Vehicles 1996+

Enhanced OBD2 Scanner Diagnostic Tool – Check Engine Code Reader with Battery Tester, EVAP, Live Data, Mode 6/8, Professional Car Scan Tool with Freeze Frame, DTC Lookup for All OBDII Vehicles 1996+

【 2026 Latest OBD2 Scanner with Full Functions for Accurate Diagnostics】🚕 iBealous car diagnostic scanner boasts stronger functionality…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear how Orthrus-Qwen3 performs across a broader range of tasks beyond initial benchmarks, or how it integrates with existing large language model deployment pipelines. Details on real-world latency and resource consumption in diverse environments remain to be seen.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include wider adoption and testing of Orthrus-Qwen3 in practical applications, further optimization for different hardware setups, and integration with tools like vLLM and SGLang. Researchers are also expected to publish more detailed benchmarks and explore extending the approach to larger models.

AI Voice Chat Module Type C Interface AI Large Model Support with Technology

AI Voice Chat Module Type C Interface AI Large Model Support with Technology

Specifications: This AI voice chat module offers a Type C interface, built in for TP5400 battery management, integrated…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

How does Orthrus-Qwen3 achieve faster inference?

It uses a dual-view diffusion approach that enables parallel token generation, breaking the sequential bottleneck of traditional autoregressive decoding, and shares an exact cache to avoid redundancy.

Is the output of Orthrus-Qwen3 identical to the original Qwen3 model?

Yes, Orthrus-Qwen3 guarantees strictly lossless generation, meaning its output distribution matches that of the base Qwen3 model exactly.

What are the hardware requirements for running Orthrus-Qwen3?

It requires a compatible GPU supporting flash attention and the ability to run the provided model checkpoints, with specific dependencies like torch and transformers installed.

Will Orthrus-Qwen3 be available for public use?

Yes, the implementation and checkpoints are available on GitHub, with upcoming native integrations planned for vLLM and SGLang.

You May Also Like

‘No way to prevent this,’ says only package manager where this regularly happens

Developers acknowledge that supply chain attacks on npm are unavoidable due to the registry’s design, raising concerns about software security and resilience.

What happens when AI starts building itself?

Recursive Superintelligence, a new AI startup, announced its launch with $650 million funding to develop self-improving AI models capable of autonomous research and self-repair.

Thai oil group PTT’s profit up 10% as Mideast crisis lifts revenue

Thailand’s PTT reports a 10% profit increase in Q1, driven by higher oil prices due to Middle East tensions and increased petroleum sales.

U.S. DOJ demands Apple and Google unmask over 100k users of car-tinkering app

The DOJ subpoenas Apple, Google, Amazon, and Walmart for user data linked to EZ Lynk’s car-tinkering app, raising privacy and legal concerns.