Scaling Data Ingestion: How I Pruned 84% of Redundant Web Data Before It Hit Our Vector Database
During my internship at a consulting tech firm, I discovered that a single RBI circular existed as 67 separate vectors. Here's how I built a MinHash + LSH deduplication pipeline that reduced 7.8M vectors to 1.2M — and made our RAG system actually useful.