Vector Quantization vs Vector Reduction
Vector quantization and vector reduction both shrink a vector index, but in fundamentally different ways. Quantization makes each vector smaller by lowering its precision, for example storing each number in fewer bits, which saves space at the cost of some accuracy. Vector reduction makes vectors fewer by eliminating semantic redundancy, keeping each remaining vector at full precision. Quantization changes the size of every vector; reduction changes how many vectors exist. The two operate on different axes and address different causes of index bloat.
The two ways everyone shrinks a vector index
High-dimensional embeddings are expensive to store and search. A single 1536-dimension float32 vector uses about 6KB, so a hundred million vectors is hundreds of gigabytes before any index overhead. Almost every technique for managing this cost works in one of two ways: making each vector use fewer bits, or making each vector use fewer dimensions. Both shrink the size of an individual vector.
Vector quantization, in detail
Quantization reduces the number of bits used to represent each value in a vector. The main forms are:
Scalar quantization converts each dimension from a 32-bit float to a lower-precision integer, commonly int8. This cuts storage roughly fourfold with modest accuracy loss, and works well for most general-purpose embedding models out of the box.
Product quantization splits each vector into subvectors and replaces each subvector with the nearest entry in a learned codebook. It can achieve higher compression than scalar quantization, but results are more dataset-dependent.
Binary quantization reduces each dimension to a single bit, keeping only its sign. This is extreme compression, on the order of thirty-twofold, and works well for some embedding types, particularly those from contrastive training, while degrading badly on others. It suits high-throughput, cost-sensitive applications where some precision loss is acceptable.
RaBitQ and Better Binary Quantization (BBQ) are modern refinements of binary quantization. They apply a random rotation before binarizing and add a correction step to recover accuracy, with theoretical error bounds. Elastic's BBQ is built on this family and is one of the most widely deployed quantization methods in production.
TurboQuant, introduced in 2025 by researchers at Google Research and NYU and presented at ICLR 2026, is among the most advanced quantizers available. It randomly rotates each vector so that an optimal quantizer can be applied to each coordinate independently, achieving near-optimal distortion provably within a small constant factor of the theoretical limit, and without any per-dataset training. Its significance goes beyond performance: by reaching near the theoretical floor, TurboQuant demonstrates that quantization as a category is approaching its mathematical ceiling. There is only so far you can compress an individual vector before accuracy suffers, and TurboQuant is already close to that limit.
A related but distinct approach: Matryoshka dimensionality reduction
Matryoshka Representation Learning (MRL) shrinks vectors along a different dimension: it reduces the number of values per vector rather than the bits per value. Models trained with MRL front-load the most important information into the earliest dimensions, so a vector can be truncated to a fraction of its length with limited accuracy loss. It is not quantization, but it shares the same fundamental property: it makes each individual vector smaller.
What all of these have in common
Scalar, product, and binary quantization, RaBitQ, BBQ, TurboQuant, and Matryoshka all make each vector smaller, whether by using fewer bits or fewer dimensions, and all accept some loss of accuracy in exchange for space. None of them changes the number of vectors in the index. If your index contains many near-duplicate or redundant vectors, every one of these techniques faithfully shrinks all of them, redundancy included.
Vector reduction: a third axis
Vector reduction takes a different approach. Instead of making each vector smaller, it makes the set of vectors smaller by eliminating semantic redundancy. Many vector indexes contain large numbers of near-duplicate vectors representing overlapping meaning. Vector reduction removes that redundancy, collapsing semantically redundant vectors into single representations, while keeping each remaining vector at full precision and full dimensionality. Green Vectors performs this reduction at ingestion through patent-pending semantic transformation, identifying redundant signal before it is ever stored.