EAGLE and EAGLE-3 Drafting
EAGLE improves speculative decoding by using the target model's internal hidden states instead of an external draft model.
Source: mortalapps.com- EAGLE improves speculative decoding by using the target model's internal hidden states instead of an external draft model.
- EAGLE-3 discards feature prediction entirely, utilizing multi-layer feature fusion and direct token prediction.
- Relies on "training-time test" simulation to align the draft model with real-world autoregressive errors.
- Achieves up to 6.5x latency speedup and 1.4x over previous EAGLE architectures.
Why This Matters
Using completely separate draft models wastes VRAM and introduces distribution mismatches, capping acceptance rates. EAGLE architectures remove the secondary model, embedding a lightweight "draft head" directly onto the target model, yielding the highest known acceptance rates for complex reasoning tasks without bloating memory overhead.
Core Intuition
Instead of having an intern write the draft, you attach a lightweight predictive circuit directly to the senior engineer's brain. This circuit samples early thoughts (pre-layer activations) and instantly projects what the final words will be, bypassing the deep, slow reasoning layers entirely.
Technical Deep Dive
Original EAGLE utilized top-layer feature prediction, which hit an artificial scaling ceiling. EAGLE-3 extracts states from multiple depths—specifically layers, , and
10—fusing low-level syntactic, mid-level structural, and high-level semantic features. To train this head, EAGLE-3 employs a "training-time test" (TTT). TTT feeds perfect features for early sequence positions, but forces the draft head to use its own imperfect predictions as inputs for subsequent positions, simulating real inference errors.