Project Gutenberg: Green Vectors at 15-Million-Vector Scale
Morphos AI benchmarked Green Vectors against traditional vectorization using the complete Project Gutenberg library, over 50,000 books containing billions of words. The goal was to measure storage efficiency, query speed, and search accuracy at a scale that pushes any system to its limits.
The challenge
Validating a vectorization technology at scale requires a dataset large and complex enough to expose real limits. The complete Project Gutenberg library provided that testbed: a vast public-domain corpus that, under traditional vectorization, produced more than 15 million vectors requiring 260GB of storage.
The approach
Two vector databases were built from the same corpus, one using standard vectorization and one using Green Vectors. Identical queries were run against both to compare storage, latency, and accuracy. The benchmark also compared against aggressive quantization to distinguish true efficiency from lossy compression.
The results
Green Vectors reduced more than 15 million vectors to 76,000, and reduced storage from 260GB to 1.3GB, a 99.5% reduction. Query latency improved by up to 4x. Search quality improved by up to 59% across domains. For comparison, aggressive 1-bit quantization on the same dataset still required 8.1GB and sacrificed accuracy to do it.
Why this matters
The significant point is not only the storage and speed gains, but that they were achieved while improving accuracy. Quantization reduces storage by lowering the precision of every vector, losing information. Green Vectors reduces storage by eliminating redundant vectors through semantic transformation, preserving full precision in those that remain. This is a fundamentally more efficient data structure, not compression.