Google's TurboQuant: Squeeze LLMs into 3-bit for 8x Speed Boost

Google Research has unveiled TurboQuant, a groundbreaking compression algorithm that dramatically reduces the memory footprint of large language models (LLMs) without sacrificing accuracy. This innovation promises to accelerate AI processing and make advanced models more accessible.
TurboQuant: Compressing LLMs for Speed
The relentless growth in the size of LLMs presents significant challenges, particularly concerning memory requirements and computational bottlenecks. TurboQuant addresses these issues head-on, offering a pathway to more efficient and scalable AI deployments.
The Key-Value Cache Bottleneck
Transformer models rely on a key-value (KV) cache to store previously computed context information for rapid retrieval. As input sequences lengthen, this cache expands, becoming a major performance bottleneck. TurboQuant tackles this by compressing the KV cache, enabling faster processing of longer sequences.
How TurboQuant Achieves Compression
TurboQuant achieves its impressive compression rates through a combination of two innovative techniques: PolarQuant and QJL (Quantized Johnson-Lindenstrauss).
PolarQuant: Compressing with Polar Coordinates
PolarQuant deviates from traditional vector processing methods by operating in polar coordinates. Instead of representing data as distances along axes, it transforms vectors into a radius (signal strength) and angles (encoding meaning). This approach leads to highly concentrated and predictable angle distributions, eliminating the need for normalization and its associated memory overhead. PolarQuant handles the bulk of the compression workload.
QJL: Error Correction with Minimal Overhead
QJL acts as a mathematical error corrector, addressing the residual errors left by PolarQuant. It leverages the Johnson-Lindenstrauss transformation to reduce high-dimensional error data to a single sign bit per value. This process preserves essential data relationships and eliminates systematic biases in attention scores without incurring additional memory overhead.
Performance and Accuracy
Google rigorously tested TurboQuant using open-source models like Llama-3.1-8B-Instruct and Ministral-7B-Instruct on established long-context benchmarks, including LongBench and Needle in a Haystack. The results are compelling:
- Significant Memory Reduction: TurboQuant reduced KV memory by at least a factor of 6 in Needle-in-a-Haystack tests.
- Maintained Accuracy: Models using TurboQuant maintained accuracy levels comparable to full-precision baselines in tasks such as question answering, code generation, and summarization. In Needle-in-a-Haystack tests, TurboQuant achieved a score of 0.997, matching the full-precision baseline.
Importantly, TurboQuant requires no model training or fine-tuning, simplifying its integration into existing AI workflows.
Real-World Applications
Google envisions TurboQuant playing a crucial role in optimizing models like Gemini and accelerating semantic vector search. By minimizing memory requirements and preprocessing overhead, TurboQuant enables the creation and querying of large vector indexes more efficiently. This technology has the potential to significantly enhance various AI applications, including:
- Improved Search: Faster and more accurate semantic search capabilities.
- Enhanced Chatbots: More responsive and context-aware conversational AI.
- Efficient Data Analysis: Accelerated processing of large datasets for insights and predictions.
Looking Ahead
TurboQuant represents a significant step forward in optimizing LLMs for performance and efficiency. By compressing the KV cache to as little as 3 bits per value, Google has demonstrated the potential to achieve substantial speed gains without compromising accuracy. The full details of TurboQuant will be presented at ICLR 2026, with PolarQuant and QJL being showcased at AISTATS 2026. More information can be found on the Google Research blog.
This breakthrough could pave the way for wider adoption of AI technology, making powerful models more accessible and practical for a range of applications. The ability to drastically reduce memory footprint while maintaining accuracy is a game-changer for the future of AI.
Recommended AI tools
Google Gemini
Conversational AI
Your everyday Google AI assistant for creativity, research, and productivity
ChatGPT
Conversational AI
AI research, productivity, and conversation—smarter thinking, deeper insights.
Perplexity
Search & Discovery
Clear answers from reliable sources, powered by AI.
Sora
Video Generation
Create stunning, realistic videos & audio from text, images, or video—remix and collaborate with Sora 2, OpenAI’s advanced generative app.
Cursor
Code Assistance
The AI code editor that understands your entire codebase
DeepSeek
Conversational AI
Efficient open-weight AI models for advanced reasoning and research
Was this article helpful?
Found outdated info or have suggestions? Let us know!


