A developer just built a fully searchable RAG p...
INSTAGRAM

A developer just built a fully searchable RAG pipeline over 2M+ pages of the Epstein Files and the architecture decisions are worth studying. The dataset is public on Hugging Face (teyler/epstein-files-20k). The MMR swap is the real lesson here. Pure cosine similarity at 100K+ chunks returns redundant results from the same document. Maximal Marginal Relevance balances relevance with diversity. Each retrieved chunk has to earn its place by adding new information, not just scoring high. Stack breakdown: LangChain RecursiveCharacterTextSplitter for chunking, all-MiniLM-L6-v2 for embeddings, ChromaDB as vector store, Groq + LLaMA 3.3 for inference. Runs on 8GB RAM minimum, free Groq API key required. Cleaning was deliberately lightweight — regex boundary detection, line buffer reconstruction, dropping docs under 100 characters. Nothing exotic. The point was pipeline architecture, not preprocessing complexity. Repo linked in comments. #ai #epsteinfiles #artificialintelligence #technews #algorithm

Feb 17, 2026 10,999
@arnitly
266 words 90% confidence
Someone just built a RAG pipeline over the Epstein files. Yup, you heard that right. And while this project might not land you an interview at Microsoft, the engineering behind this is worth understanding if you want to build pipelines at scale. The raw data that was used was over 2 million pages. And the problem that most tutorials completely skip over is that when you search across a database that large, the standard similarity search algorithm returns the top chunks closest to your query. That sounds about right, but those top results are almost always from the same document. And that is not context, that is echo. The fix is switching to MMR or maximal marginal relevance. Instead of just finding the most similar chunks, it finds the most similar chunks that is also the least similar to everything that has already been retrieved. So each result adds new information. So when you ask a question, you get diverse grounded context and the LLM answers strictly from that. That one swap is the difference between a RAG that hallucinates and one that actually surfaces what is in these documents. The cleanup pipeline runs in three stages. The first stage is cleaning and reconstructing the raw files. The second is chunking them semantically. And then finally embedding everything into chroma database. The pre-computed embeddings are already uploading to HuggingFace by the developer. So you can skip the heavy lifting and jump straight to inferencing. I'll leave the repo link in the comments below. Follow me for more and I'll catch you guys in the next one.

This video discusses a RAG pipeline built over the Epstein files, emphasizing the importance of using Maximal Marginal Relevance for diverse search results. It outlines the architecture and tools used in the project.

  1. A RAG pipeline was built over 2 million pages of Epstein files.
  2. Maximal Marginal Relevance improves search results by ensuring diversity.
  3. Standard similarity search often returns redundant results from the same document.
  4. The pipeline architecture focuses on efficiency rather than complex preprocessing.
  5. Stack includes LangChain, MiniLM, ChromaDB, Groq, and LLaMA.
  6. Pre-computed embeddings are available on Hugging Face for easy access.
  • LinkedIn post: Key takeaways from the Epstein RAG pipeline.
  • Tweet: How MMR improves search results in large datasets.
  • Checklist: Steps to build your own RAG pipeline.

Save videos. Search everything.

Build your personal library of inspiration. Find any quote, hook, or idea in seconds.

Create Free Account No credit card required
Original