LLM Inference Systems

EAGLE and EAGLE-3 Drafting

EAGLE improves speculative decoding by using the target model's internal hidden states instead of an external draft model.

Published June 1, 2026 · By MortalApps · 3 min read · ~517 words

TL;DR

EAGLE improves speculative decoding by using the target model's internal hidden states instead of an external draft model.
EAGLE-3 discards feature prediction entirely, utilizing multi-layer feature fusion and direct token prediction.
Relies on "training-time test" simulation to align the draft model with real-world autoregressive errors.
Achieves up to 6.5x latency speedup and 1.4x over previous EAGLE architectures.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Using completely separate draft models wastes VRAM and introduces distribution mismatches, capping acceptance rates. EAGLE architectures remove the secondary model, embedding a lightweight "draft head" directly onto the target model, yielding the highest known acceptance rates for complex reasoning tasks without bloating memory overhead.

Core Intuition

Instead of having an intern write the draft, you attach a lightweight predictive circuit directly to the senior engineer's brain. This circuit samples early thoughts (pre-layer activations) and instantly projects what the final words will be, bypassing the deep, slow reasoning layers entirely.

Technical Deep Dive

Original EAGLE utilized top-layer feature prediction, which hit an artificial scaling ceiling. EAGLE-3 extracts states from multiple depths—specifically layers, , and 10—fusing low-level syntactic, mid-level structural, and high-level semantic features. To train this head, EAGLE-3 employs a "training-time test" (TTT). TTT feeds perfect features for early sequence positions, but forces the draft head to use its own imperfect predictions as inputs for subsequent positions, simulating real inference errors.

Key Takeaways

EAGLE-3 Abandons feature prediction for direct token prediction.

Taps pre-layer activations across the model depth instead of just the top layer.

"Training-time test" forces the draft model to learn recovery from its own autoregressive errors.

Delivers extreme latency improvements with negligible VRAM increases.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts