Ground-up search systems course
Build a search engine. Understand every layer.

IndexZero

A ground-up search systems course. From raw text to ranked results to vector retrieval — one layer at a time.

Why does this course exist?

Most engineering programs teach databases or machine learning. Almost none teach search as a systems discipline. Students graduate knowing SQL and transformers but have no mental model of:

Search is the most deployed non-trivial backend system in production. Every e-commerce site has one. Every SaaS product has one. Every RAG pipeline starts with retrieval. Most engineers know how to call the API. Few understand why it returns what it returns.

Why now

RAG (Retrieval Augmented Generation) put search infrastructure back in the spotlight. Every LLM application needs a retrieval layer. If you're building one, you need to understand what happens below the API.

Why build from scratch

You don't understand a system until you've made its mistakes yourself. Using Elasticsearch as a black box teaches you nothing about why recall drops after an index merge, or why BM25 outperforms a neural model on short queries.

Who is this for?

Track
Profile
What they get
Depth
UG
B.Tech final year
Understand how search works; build a working system from scratch
M0 → M7
PG
MTech / MS
IR foundations; starting point for search research
M0 → M9 + design reviews
IND
Practitioners
Understand the tools you already use at work
Self-paced, any module
OPEN
Self-learners
Work at your own pace, all code on GitHub
Full course, open access

Module map

One codebase, IndexZero, extended incrementally across 10 modules. Each module produces a working system, not just a subsystem.

Module
Name
What gets built
Core concept
M0
The Problem
Ranking audit — no code
Why search is hard; forming hypotheses
M1
Text as Data
Tokenizer + vocabulary builder
Normalization; Zipf's law; preprocessing decisions
M2
The Index
Inverted index on disk
Postings lists; disk layout; lookup cost
M3
Ranking
BM25 scorer
TF-IDF → BM25; why frequency alone fails
M4
Did It Work?
Eval harness
nDCG, MRR, precision@k; benchmark discipline
M5
Smarter Queries
Query processor
Boolean, phrase, proximity; query decomposition
M6
Meaning, Not Words
Vector search (HNSW / flat)
Embeddings; semantic vs lexical; recall-latency frontier
M7
Both Together
Hybrid retrieval
RRF; score fusion; cross-encoder re-ranking
M8
Keeping It Alive
Index pipeline
Incremental updates; deletes; segment merges
M9
The Full System
End-to-end search API
Latency budgeting; revisit M0 hypothesis

M0: The Problem

No code. No setup. Just observation, curiosity, and a hypothesis document that the rest of the course will systematically prove or disprove.

Module 0 · The Problem

Before you build anything, you have to feel the problem.

Search looks simple from the outside. You type words. Results appear. The illusion breaks the moment you ask: why this result, and not that one? This module is about breaking that illusion — before you have the vocabulary to explain it.

What students will learn

What they will get wrong (and that's the point)

Common assumption

"The site shows the most popular product first."

What's actually happening

Popularity is one of many signals — weighted against recency, margin, inventory, query match, and personalisation.

Common assumption

"Better search means more results."

What's actually happening

Precision and recall trade off. Showing more results lowers average relevance. Good search is ruthlessly selective.

Common assumption

"AI / semantic search is just better."

What's actually happening

Keyword search outperforms vector search on exact queries and fresh content. Neither dominates universally.

Common assumption

"The search box just queries a database."

What's actually happening

A separate index, built offline and structured for retrieval, is what gets queried. It's not the same store as the product database.

The M0 Exercise: Ranking Audit

Exercise · M0 · The Ranking Audit

Reverse-engineer a real search result page

Pick any Indian e-commerce site you actually use — Flipkart, Meesho, Nykaa, Zepto, whatever. You will observe, hypothesize, and document.

  1. 01Run 3 different searches on the same site. Choose one broad query ("shoes"), one specific query ("Nike Air Max size 10"), one ambiguous query ("blue").
  2. 02Screenshot the top 10 results for each search.
  3. 03For each of the top 3 results per query: write one hypothesis for why it ranked there. Be specific — not "it's popular" but "it ranked here because X."
  4. 04Find one result that surprises you — either too high or too low. Hypothesize why the system got it wrong.
  5. 05Write a one-paragraph answer to: "What signals do you think this search engine is using?" You will revisit this answer at M9.
Deliverable: Ranking Hypothesis Doc — 1 page, submitted before M1 begins. No code. No right answers.

The narrative thread

The Ranking Hypothesis Doc is not graded for correctness. It is a baseline artifact. At M9, after building a full search system, students revisit it. The delta between their M0 hypothesis and their M9 understanding is the most honest measure of what they learned.

Most students will find their M0 hypotheses were partially right and fundamentally incomplete. That gap is the course.

What data will students work with?

One dataset, used progressively deeper across all modules. Students build familiarity with the corpus the same way they build familiarity with the system.

Primary dataset: Amazon ESCI — real product queries with human relevance labels (Exact, Substitute, Complement, Irrelevant). Familiar domain, real queries, proper ground truth for eval.

Module
Dataset used
Why this data at this stage
What becomes possible
M0
Live site (student choice)
Immediate familiarity; personal relevance
Forms intuition before formalism
M1–M2
Flipkart product titles
Short, messy, Indian English — great for normalization edge cases
Tokenizer stress testing
M3–M9
Amazon ESCI subset
Real queries + relevance labels; consistent across modules
Proper eval; apples-to-apples comparison of all approaches

Interested?

This course is in active development. If you'd like early access or want to use it in a classroom setting, get in touch.

Reach out on X Email