TurboQuant • fdke.vin

We introduce a set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines.

From TurboQuant: Redefining AI efficiency with extreme compression by Google Research.

The practical bit is the combination: PolarQuant handles most of the compression, then QJL spends a single residual bit to correct bias. Google claims lossless-or-near-lossless KV-cache compression on long-context benchmarks, at least 6x memory reduction, and up to 8x attention-logit speedup on H100s.

It seems that graphics cards and memory prices can finally drop.