Retrieval-augmented generation is table stakes for AI engineering interviews. If you are interviewing at an AI-native startup, a foundation model lab, or a F500 building internal copilots, you will get RAG questions in at least two of your four rounds. Comp sits between $130k and $280k base depending on seniority and city.
This guide is a 35-question Q&A bank covering what actually gets asked. If you want timed mocks that drill these patterns under pressure, Interview Coder runs sessions with live feedback.
Why RAG Is Now Baseline in AI Engineering Interviews
Three years ago, RAG was niche. You could get an ML offer without touching a vector database. That window is closed. Every team building on LLMs hits the same wall around month three: the model does not know their data. The fix is some flavor of RAG, and hiring managers test it directly.
What a typical loop looks like:
The trap is treating RAG like a coding problem. It is a system design problem with ML evaluation baked in. The candidates who pass talk about tradeoffs, cite real papers, and have numbers in their head for latency and recall.
RAG Fundamentals (8 Questions)
Warm-ups. If you cannot answer in 60 seconds, the interviewer marks you "not serious" and never recovers.
1. What is retrieval-augmented generation, and why does it exist?
RAG is a pattern where you fetch relevant documents at query time and stuff them into the LLM prompt as context. It exists because LLMs have a fixed knowledge cutoff, no access to private data, and a tendency to hallucinate on anything they have not seen. Frame it as a workaround for parametric memory limits, not a magic trick.
2. RAG vs fine-tuning. When do you pick which?
RAG is for knowledge that changes often, is too large to fit in weights, or needs auditability. Fine-tuning is for style, format, or specialized reasoning patterns you want baked into the model. Most production systems use both. Treat them as orthogonal tools, not alternatives.
3. Walk me through the full RAG pipeline.
Ingest, clean, chunk, embed each chunk, store embeddings plus metadata in a vector index. At query time, embed the query, similarity search, optionally rerank, format chunks into a prompt with the user question, send to the LLM, return the answer with citations. Do not skip reranking or citations when you list the stages.
4. What chunking strategies have you used, and how do you pick chunk size?
Fixed-size is simple but breaks sentences. Recursive character splitting respects paragraph boundaries. Semantic chunking groups by embedding similarity. Sliding windows with overlap preserve cross-chunk context. Size depends on workload: 256-512 tokens for QA (with 50-token overlap as a reasonable default), 1024-2048 for long-form synthesis.
5. How do text embeddings work at a high level?
An embedding model maps text to a dense vector in high-dimensional space (typically 768 to 3072 dims). Vectors close in that space are semantically similar. Modern models are transformer encoders trained with contrastive loss on related/unrelated text pairs. Do not confuse them with one-hot encodings or word2vec from 2014.
6. Which embedding model do you reach for first, and why?
For English, OpenAI text-embedding-3-large or Cohere embed-v3. For open source, BGE or E5. For multilingual, multilingual-e5. Check the MTEB leaderboard but benchmark on your own data before committing.
7. Why does dimensionality matter for embeddings?
Higher dimensions mean better representation quality but more memory, slower search, and larger indexes. Matryoshka embeddings let you truncate with graceful degradation, useful when fitting billions of vectors in RAM. Frame the answer as a storage/latency/quality tradeoff.
8. What is cosine similarity, and why is it the default for embeddings?
Dot product of two vectors divided by the product of their magnitudes. It measures angle, ignoring magnitude. Embedding models normalize outputs to unit length, which makes cosine equivalent to dot product and kills sensitivity to text length. Be ready to write the formula on a whiteboard.
Retrieval (8 Questions)
Retrieval is the part of RAG that interviewers probe hardest. Get this section right and you can survive a weak generation round.
9. Pinecone vs Weaviate vs Chroma vs pgvector. Walk me through the tradeoffs.
Pinecone: fully managed, fast, scales to billions, expensive, vendor lock-in. Weaviate: open source, hybrid search built in, decent self-hosted operator. Chroma: great for prototypes, weak past a few million vectors. Pgvector: the right call when you already run Postgres and your workload is under 10M vectors with moderate QPS. Give a real recommendation, not a feature checklist.
10. What is HNSW, and why is it the standard index for ANN search?
Hierarchical navigable small world is a graph-based approximate nearest neighbor index. Multi-layer graph: higher layers have long-range edges, lower layers have local neighborhoods. Search greedily descends the layers, giving log-time complexity in practice with high recall. Know it is graph-based (not tree-based) and why it beats IVF for most workloads.
11. When would you use IVF or product quantization instead of HNSW?
IVF is faster to build and uses less memory, so it wins at billion-scale where rebuild time matters. Product quantization compresses vectors to 8-bit or 4-bit codes, cutting memory 8-32x at the cost of some recall. Most systems at scale use IVF-PQ for cold storage and HNSW for hot.
12. Explain hybrid search. Why combine BM25 with dense vectors?
BM25 is a sparse keyword method that nails exact matches, rare terms, and acronyms. Dense embeddings nail semantic similarity but miss literal matches like product SKUs and error codes. Hybrid runs both and fuses results with reciprocal rank fusion. Lifts recall@10 by 10-30 percent versus dense alone.
13. What is reciprocal rank fusion, and how does it work?
RRF takes the rank of each document in each result list, computes 1/(k+rank), and sums across lists. The constant k (usually 60) dampens top-rank influence. It works because it does not require score normalization across retrievers, which is what makes naive BM25+dense fusion painful. Know the formula and why simple score addition fails.
14. How does reranking improve RAG, and what reranker would you pick?
A reranker takes the top n candidates and rescores them with a cross-encoder that sees query and document together. Much more accurate than the bi-encoder used for retrieval because it attends across both sides. Cohere Rerank, BGE reranker, Voyage rerank are the common picks. Retrieve top 50-100, rerank to top 5-10. Know the cross-encoder vs bi-encoder distinction cold.
15. What is query rewriting, and when is it worth the latency?
An LLM reformulates the user query before retrieval. Common patterns: HyDE (generate a hypothetical answer and embed that), multi-query (generate several rewrites, union results), decomposition (break complex queries into sub-questions). Costs an extra LLM call but lifts recall on ambiguous or short queries.
16. How do you handle retrieval over structured data, like tables or product catalogs?
Pure vector search struggles here. Extract structured fields at ingest, store as filterable metadata, combine semantic search with metadata filters. For complex queries, route to a text-to-SQL pipeline. The hard part is the routing decision: when does the query need SQL vs RAG vs both? Do not embed your way out of a SQL problem.
Generation (5 Questions)
Most candidates over-prepare here. The bar is lower than you think, but the details matter.
17. What prompt structure do you use for RAG?
System message scoping the assistant's role and grounding rules, retrieved chunks with clear separators and source IDs, the user query, and an instruction to cite sources or say "I do not know" if the answer is not in context. Context-before-question is standard for most models, though Claude prefers the inverse. Have a template you reuse.
18. How do you handle context window overflow when retrieval returns too many tokens?
Rank by relevance and truncate. Summarize each chunk with a smaller model first. Map-reduce: answer each chunk independently, then synthesize. Or use a long-context model like Gemini 1.5 Pro and pay the latency cost. Have more than one strategy and know long-context is not always cheaper.
19. How do you do citation and attribution well?
Embed source IDs in the formatted context, instruct the model to cite by ID after each claim, post-process to resolve IDs to URLs. For higher accuracy, use structured output (JSON mode) to force citations as separate fields. Watch for citation hallucination. Anthropic's Citations API ships this as a first-class feature. Treat citations as a verification problem.
20. How do you reduce hallucination in a RAG system?
Fix retrieval first, because most hallucinations come from missing or wrong context. Add explicit "say I do not know" instructions and few-shot examples. Lower temperature for factual tasks. Add a post-generation faithfulness check that verifies each claim against the retrieved chunks. Start with retrieval, not prompt tricks.
21. Generation latency vs end-to-end latency in RAG?
Generation latency is just the LLM call. End-to-end includes query embedding, retrieval, reranking, prompt construction, generation, post-processing. In a tuned system, generation dominates; in a poorly tuned one, retrieval and reranking each add 100-500ms. Streaming the first token hides most perceived latency, and it is the cheapest UX win you have.
Evaluation (6 Questions)
If you memorize one section, make it this one. Most candidates blank here, which is the easiest way to stand out.
22. How do you evaluate retrieval quality?
Labeled set of (query, relevant doc IDs) pairs. Measure precision@k, recall@k, MRR, NDCG@k. Precision@k tells you how clean the top results are, recall@k tells you whether you found everything, MRR rewards getting the right answer first, NDCG handles graded relevance. Define each and pick the right one for the task.
23. How do you evaluate generation quality in RAG?
Faithfulness (answer matches retrieved context), answer relevance (addresses the question), context precision/recall (retriever surfaced the right context). Ragas and TruLens are the common frameworks. For high-stakes domains, human-grade a sample and use it to calibrate the automated metrics, since automated metrics need human anchoring.
24. What is LLM-as-judge, and what are its failure modes?
Use a strong LLM to grade outputs against a rubric. Cheaper than human eval and correlates well at population level. Failure modes: position bias, verbosity bias, self-preference for same model family, rubric drift. Mitigate by swapping positions, blinding the source, and calibrating against humans.
25. How do you build a golden set for RAG evaluation?
Start with real user queries from logs, dedupe, sample for diversity. Label answers with SMEs, capture the ideal retrieved documents, version the set so you can track regression. Aim for 100-500 queries to start. Start from real traffic, not synthetic queries.
26. Offline vs online evaluation. When do you use each?
Offline runs on a fixed golden set: fast, reproducible, good for catching regressions in CI. Online runs in production via A/B tests: captures real user behavior, slow and noisy. Offline for every PR, online for any change that touches user-facing quality. Neither alone is enough.
27. How do you detect retrieval drift over time?
Track precision@k and recall@k on a rolling golden set. Monitor distribution of retrieved scores: if average top-1 similarity drops, your corpus has shifted away from the queries. Sample real queries weekly and re-grade with LLM-as-judge to catch slow degradation. Treat production as a thing that decays.
Production System Design (5 Questions)
The round where senior candidates earn the title. Real architecture, not "I would use Pinecone."
28. Design a multi-tenant RAG system for 10,000 enterprise customers.
Each tenant needs strict data isolation. Options: separate index per tenant (high isolation, painful at scale), shared index with tenant ID as metadata filter (cheap, leakage risk), hybrid with high-value tenants on dedicated indexes. Row-level security at the vector store, encrypt embeddings at rest, audit every retrieval. Latency budget is 1-2 seconds end-to-end, so cache at every layer. Address the cost-vs-isolation tradeoff explicitly.
29. Walk me through a latency budget for a 2-second RAG response.
Query embedding: 50-100ms. Retrieval: 50-150ms. Rerank: 100-300ms. Prompt construction and network: 50ms. LLM first token: 300-800ms. Streaming the rest: 500-1000ms. Budget gone. Compress by parallelizing embedding and metadata lookup, skipping rerank for high-confidence queries, and streaming the first token immediately.
30. How do you optimize RAG cost at scale?
Cache query embeddings (queries repeat more than you think). Cache retrieval results for high-frequency queries with short TTL. Route easy queries to a smaller model. Use prompt caching (Anthropic, OpenAI both ship it). Reduce context size by reranking aggressively. Batch embeddings at ingest. Quantize the index. Track cost per query as a first-class metric.
31. What caching strategies work for RAG?
Three layers: query embedding cache, retrieval result cache (key: query+filters, value: doc IDs), full response cache (key: query+user context, value: answer). Each has different hit rates and invalidation needs. Semantic caching, using embedding similarity to find near-duplicate queries, lifts hit rates 2-5x over exact-match. It is more than "Redis in front of OpenAI."
32. How do you handle freshness when your corpus updates constantly?
Incremental indexing with a queue: new docs hit a write-ahead log, get embedded in a worker pool, then upserted. For real-time freshness, run a secondary hot index for the last N hours and merge at query time. Invalidate caches on update. For deletion, soft-delete via metadata flag and run periodic compactions to actually remove vectors. Real ingest pipeline, not "I re-index nightly."
Advanced (3 Questions)
Differentiators in senior loops. You need at least one you can go deep on.
33. What is agentic RAG, and when is it worth the complexity?
The LLM plans retrieval: decides what to search for, runs multiple queries, evaluates results, decides when to stop. Useful for multi-hop questions where one round cannot find the answer. Costs more LLM calls per query and adds latency, so reserve it for complex queries and route simple ones to standard RAG. ReAct and Plan-and-Execute are the common patterns. Lilian Weng's LLM-powered autonomous agents post is the starting reference. It is not free.
34. What is GraphRAG, and what does it solve that vector RAG cannot?
GraphRAG builds a knowledge graph from your corpus (entities and relationships extracted with an LLM), then traverses it at query time to connect multiple facts. Vector RAG struggles with multi-hop questions about relationships ("who reports to the head of engineering for the team that built X"). GraphRAG handles those at the cost of an expensive ingest pipeline (roughly 10x more than vector RAG). Microsoft's GraphRAG paper and Neo4j's tooling are the common references.
35. What is contextual retrieval, and what does it solve?
Anthropic's contextual retrieval prepends a short LLM-generated context to each chunk before embedding ("This chunk discusses Q3 revenue from the 2024 annual report"). Fixes the lost-context problem where a chunk pulled from its document loses meaning. Anthropic reports retrieval failure rates dropped 35-50 percent when combined with reranking. Read the original contextual retrieval post before any senior interview, it gets cited constantly.
How to Prepare
Reading this is the warm-up. The actual prep is reps with feedback. What works:
If you want timed mock sessions with rubric-based feedback, Interview Coder is what I built for exactly this. Practice the format until it stops feeling like a performance.