A developer just built a fully searchable RAG pipeline over 2M+ pages of the Epstein Files and the architecture decisions are worth studying. The dataset is public on Hugging Face (teyler/epstein-files-20k). The MMR swap is the real lesson here. Pure cosine similarity at 100K+ chunks returns redundant results from the same document. Maximal Marginal Relevance balances relevance with diversity. Each retrieved chunk has to earn its place by adding new information, not just scoring high. Stack breakdown: LangChain RecursiveCharacterTextSplitter for chunking, all-MiniLM-L6-v2 for embeddings, ChromaDB as vector store, Groq + LLaMA 3.3 for inference. Runs on 8GB RAM minimum, free Groq API key required. Cleaning was deliberately lightweight — regex boundary detection, line buffer reconstruction, dropping docs under 100 characters. Nothing exotic. The point was pipeline architecture, not preprocessing complexity. Repo linked in comments. #ai #epsteinfiles #artificialintelligence #technews #algorithm

Name: A developer just built a fully searchable RAG pipeline over 2M+ pages of the Epstein Files and th...
Duration: 77 s
Description: Video on VideoVault

1:17 Feb 25, 2026 204,676 14,156

@arnitly

266 words

Someone just built a RAG pipeline over the Epstein files. Yup, you heard that right. And while this project might not land you an interview at Microsoft, the engineering behind this is worth understanding if you want to build pipelines at scale. The raw data that was used was over 2 million pages. And the problem that most tutorials completely skip over is that when you search across a database that large, the standard similarity search algorithm returns the top chunks closest to your query. That sounds about right, but those top results are almost always from the same document. And that is not context, that is echo. The fix is switching to MMR or maximal marginal relevance. Instead of just finding the most similar chunks, it finds the most similar chunks that is also the least similar to everything that has already been retrieved. So each result adds new information. So when you ask a question, you get diverse grounded context and the LLM answers strictly from that. That one swap is the difference between a RAG that hallucinates and one that actually surfaces what is in these documents. The cleanup pipeline runs in three stages. The first stage is cleaning and reconstructing the raw files. The second is chunking them semantically. And then finally embedding everything into chroma database. The pre-computed embeddings are already uploading to HuggingFace by the developer. So you can skip the heavy lifting and jump straight to inferencing. I'll leave the repo link in the comments below. Follow me for more and I'll catch you guys in the next one.

Summary

Maximal Marginal Relevance (MMR) enhances chunk retrieval by balancing relevance and diversity, ensuring each retrieved chunk adds new information rather than repeating content from the same document. The pipeline utilizes LangChain for chunking, MiniLM for embeddings, and ChromaDB for storage, with a focus on lightweight cleaning and efficient processing of over 2 million pages of data. This approach prevents redundancy and improves the quality of context provided to language models.

Save videos. Search everything.

Build your personal library of inspiration. Find any quote, hook, or idea in seconds.

Create Free Account No credit card required

Original

Summary

Tags

Save videos. Search everything.