AI engineer job postings grew 143% year over year on LinkedIn's Economic Graph through 2025, and the role pays $110K to $210K depending on seniority and city per Levels.fyi. The catch: most interview prep content on the internet is two years out of date and still teaches you how to derive softmax. That is not what the loop tests anymore.
The 2026 AI engineer interview is a different beast. You will get questions on RAG chunking, agent failure modes, LLM-as-judge evals, and how to keep latency under 800ms when every call hits a frontier model. This guide is 50 questions across the five clusters that cover almost every loop, with a model answer and what the interviewer is actually grading. If you want timed reps in the same format, Interview Coder runs mock AI engineering sessions.
What Is an AI Engineer in 2026
An AI engineer builds products on top of foundation models. That is the whole job description. You wire LLMs into retrieval systems, design agent loops, write evals, ship inference endpoints, and keep the bill under control. You do not train models from scratch.
This is different from an ML engineer, who owns training pipelines, feature stores, and model deployment for in-house models. It is also different from a data scientist, who runs experiments and ships dashboards. The AI engineer role exploded after GPT-4 launched in early 2023 because companies suddenly needed people who could turn an API into a product without setting money on fire.
Compensation in 2026:
The 143% YoY growth in postings is concentrated in two buckets: scaleups building agent products (think customer support, coding agents, ops automation) and incumbents bolting LLMs onto existing software. The questions reflect that split. Scaleups grill you on agent design and evals. Incumbents grill you on RAG and integration.
Foundational Questions (LLM and Transformer Basics)
These 10 are warm-ups. If you stumble on more than two, the rest of the loop gets harder.
1. Explain attention in one paragraph
Answer: Attention lets each token in a sequence look at every other token and decide which ones matter for predicting the next token. You compute query, key, and value projections, take the softmax of QK^T scaled by √d, then multiply by V to get a weighted sum of values.
What they want: Whether you can explain a transformer without reading from a slide.
2. What is the KV cache and why does it matter?
Answer: The KV cache stores key and value tensors from previous decoding steps so you do not recompute attention over the entire prefix on every new token. It is the reason inference scales linearly with output length instead of quadratically, and the reason long contexts blow up your GPU memory.
What they want: Whether you understand inference cost, not just training math.
3. Difference between fine-tuning and RAG
Answer: Fine-tuning bakes new knowledge or behavior into model weights through gradient updates on a labeled dataset. RAG keeps the model frozen and injects relevant documents at inference time through a retrieval step, which is cheaper to update and easier to audit.
What they want: Whether you reach for the right tool. Most "we need to fine-tune" requests are actually RAG problems.
4. What does temperature do, and when would you set it to 0?
Answer: Temperature scales the logits before softmax, so low values make the distribution sharper and high values make it flatter. Set it to 0 for deterministic outputs (classification, structured extraction, anything you will eval), and raise it for creative tasks where you want variation.
What they want: Whether you understand sampling enough to debug a flaky output.
5. How does tokenization affect cost and behavior?
Answer: Models charge per token, not per character, so a poorly tokenized language (Thai, Korean, code with rare symbols) can cost 3-5x more than English. It also affects context window math, prompt design, and where the model splits words for output streaming.
What they want: Whether you have actually shipped to non-English users or done unit economics on an API product.
6. What is a context window, and what happens when you exceed it?
Answer: The context window is the maximum number of tokens the model can attend to in one call, typically 128K to 2M for frontier models in 2026. Exceeding it either truncates the input silently (older APIs) or throws a 400 error (newer ones), and either way your retrieval and history strategy needs to fit the budget.
What they want: Whether you have hit this in production and built backpressure for it.
7. Why do LLMs hallucinate, and how do you mitigate it?
Answer: LLMs predict the next token from a distribution learned over training data, with no internal mechanism to flag uncertainty about facts. Mitigations include grounding through RAG with citation enforcement, structured output validation, lower temperature, and LLM-as-judge checks against a source document.
What they want: Whether you see hallucinations as an engineering problem with knobs, not a model defect.
8. What is an embedding, and how do you pick a model?
Answer: An embedding is a dense vector representation of text where semantic similarity maps to cosine distance. Pick a model based on MTEB benchmark performance for your domain, dimension count (768 vs 1536 vs 3072 changes storage cost 2-4x), and whether you need a multilingual or code-specific variant.
What they want: Whether you treat embedding model choice as a real decision or just default to OpenAI.
9. What is quantization, and what tradeoff does it make?
Answer: Quantization reduces the precision of model weights (FP16 to INT8 or INT4) to shrink memory and speed up inference, at the cost of small quality regressions. INT8 typically costs you 1-2% on benchmarks but cuts VRAM in half, which is the difference between fitting a 70B model on one GPU or four.
What they want: Whether you have run self-hosted inference and made deployment tradeoffs.
10. What is mixture of experts, and why does it matter for cost?
Answer: MoE routes each token to a small subset of expert subnetworks instead of activating the full model, so a 400B parameter MoE might only run 40B active params per token. This is why frontier models in 2026 can have huge parameter counts without proportional inference cost.
What they want: Whether you read papers or just read tweets.
RAG Architecture Questions
Almost every AI engineer loop in 2026 has at least one deep RAG round. These 10 are the core.
11. How would you chunk a 200-page PDF for retrieval?
Answer: Start with semantic chunking around 500-1000 tokens with 10-20% overlap, keeping sections and tables intact instead of splitting on raw character counts. Add document-level and section-level metadata to each chunk so you can filter and rerank by structure later.
What they want: Whether you have actually shipped a RAG pipeline or just read a tutorial.
12. BM25 versus dense retrieval. When does each win?
Answer: BM25 wins on exact keyword matches, rare terms, and acronyms because it weights term frequency directly. Dense retrieval wins on semantic paraphrase and conceptual queries because embeddings capture meaning rather than surface form.
What they want: Whether you understand why pure dense retrieval fails on "what is the SLA in section 4.3."
13. What is hybrid search, and how do you combine the scores?
Answer: Hybrid search runs BM25 and dense retrieval in parallel and merges the result sets, typically with reciprocal rank fusion or a learned weighted sum. RRF is the default because it does not require score normalization across two scoring systems.
What they want: Whether you have built this or only heard the term.
14. Why do you need a reranker, and what does it cost?
Answer: First-stage retrieval optimizes for recall (find anything relevant in the top 100), and a cross-encoder reranker optimizes for precision (put the right answer at position 1-5). It adds 50-200ms of latency per query and roughly 10-30% in cost, but typically lifts answer quality by 15-40% on hard queries.
What they want: Whether you know the two-stage retrieval pattern is the default for any serious RAG system.
15. How do you evaluate a RAG system?
Answer: Build a golden set of 100-500 query-answer-source triples, then measure retrieval (recall@k, MRR) separately from generation (faithfulness, answer relevance, citation accuracy). Run an LLM-as-judge against the golden set on every change and gate deploys on it.
What they want: Whether you have an eval loop or you ship blind.
16. Top three failure modes of a RAG pipeline?
Answer: Wrong chunk retrieved (fix with reranker or better chunking), right chunk but model ignores it (fix with prompt structure and citation enforcement), and out-of-distribution query the index does not cover (fix with query rewriting or a fallback). Most production RAG bugs are one of these three.
What they want: Whether you have debugged a real RAG system at 3am.
17. What is query rewriting, and when do you use it?
Answer: Query rewriting uses an LLM to expand, clarify, or decompose the user query before retrieval, often into multiple subqueries. Use it for conversational RAG (resolve coreferences), multi-hop questions (decompose), or short ambiguous queries (expand with synonyms).
What they want: Whether you treat the user query as raw input or as a signal to be processed.
18. How do you handle citations and avoid the model fabricating sources?
Answer: Pass document chunks with explicit IDs in the prompt, instruct the model to cite IDs inline, then validate every cited ID exists in the retrieved set before returning the response. If a citation is missing or fabricated, either strip it or regenerate with a stricter prompt.
What they want: Whether you treat hallucinated citations as a bug class to defend against, not an unsolvable LLM quirk.
19. What is multi-vector retrieval (ColBERT style)?
Answer: Instead of one embedding per document, you store one embedding per token and compute late interaction at query time using MaxSim. It is more accurate than single-vector dense retrieval but 10-50x more expensive in storage and compute.
What they want: Whether you read the actual literature or only the marketing.
20. Your RAG works on 1K docs. How do you scale to 10M?
Answer: Move from in-memory or naive vector store to a managed vector DB with HNSW or IVF indexes, partition by metadata (tenant, region, doc type), and add a query router that hits the right partition. Add a caching layer for frequent queries and precompute reranked top-K for hot documents.
What they want: Whether you can think about RAG as a distributed systems problem.
Agentic Systems Questions
Agents are where most 2026 startups are spending their headcount. Expect a full round on this.
21. Explain ReAct in one paragraph
Answer: ReAct interleaves reasoning ("Thought") and action ("Action") steps in a single LLM loop, where the model writes its plan, calls a tool, observes the result, and decides the next step. It is the simplest agent loop and the foundation under most tool-using agents shipped in 2024-2026.
What they want: Whether you know the canonical agent paper or are just buzzwording.
22. Planner-executor versus single-loop agent. When do you pick which?
Answer: Single-loop (ReAct) works when each step is cheap, the search space is small, and you can recover from one bad step. Planner-executor splits work into a planning LLM that produces a structured plan upfront and an executor that runs it, which is better for expensive tools, longer horizons, and tasks where committing to a plan early reduces cost.
What they want: Whether you have built more than one agent and learned why ReAct breaks at scale.
23. How do you design a function schema for tool calling?
Answer: Use JSON schema with required fields, strict types, and short clear descriptions because the LLM reads the description like a system prompt. Keep parameter names self-explanatory, avoid optional booleans (the model gets them wrong), and put examples in the description for any non-obvious format.
What they want: Whether you have debugged a model that keeps passing the wrong argument shape.
24. How do you give an agent memory?
Answer: Short-term memory is just conversation history truncated to fit context. Long-term memory is a separate store (vector DB for semantic recall, key-value for facts, episodic log for past sessions) that the agent queries through a tool call, not by stuffing everything into the prompt.
What they want: Whether you understand that "context window" and "memory" are different problems.
25. How do you coordinate multiple agents?
Answer: Pick one of three patterns: hierarchical (one orchestrator delegates to specialists), peer-to-peer with a shared scratchpad (each agent reads and writes a common state), or pipeline (output of one is input of next). Hierarchical is the default for product use cases because it is the easiest to debug.
What they want: Whether you have actually run multi-agent in production or just watched a demo.
26. Your agent fails 20% of the time. How do you debug?
Answer: Log every trace (input, every tool call, every model output, final result), bucket failures by failure mode (wrong tool picked, right tool but wrong args, tool succeeded but model ignored result, infinite loop), and fix the largest bucket first. Then add an automated eval against a fixed set of failed traces to catch regressions.
What they want: Whether you treat agent debugging as observability work, not vibes.
27. How do you prevent an agent from running forever?
Answer: Hard cap on max iterations (typically 10-25), per-iteration timeout, total wall-clock budget, and a cost ceiling that aborts when token spend exceeds threshold. Also a "no progress" check that compares state across iterations and breaks if the agent is looping.
What they want: Whether you have been on call for an agent that burned $400 in one user session.
28. How do you sandbox tool execution?
Answer: Untrusted code goes in an isolated container or VM with no network, no filesystem access outside a working directory, and a CPU/memory cap. For tool calls that touch external APIs, scope credentials to read-only or per-tenant, log every call, and require human approval for high-risk actions like writes or deletes.
What they want: Whether you have shipped an agent without inviting a security incident.
29. How do you evaluate an agent end to end?
Answer: Build a benchmark of task scenarios with success criteria (did it complete the task, how many steps, how much it cost), run each scenario N times to capture stochasticity, and use LLM-as-judge for tasks where the success criterion is qualitative. Track success rate, mean iterations, and p95 cost over time.
What they want: Whether you can prove your agent got better or worse, not just feel it.
30. What is structured output enforcement, and why does it matter for agents?
Answer: Constrained decoding (JSON mode, grammar-based sampling, regex constraints) forces the model to emit valid structured output token by token, eliminating parsing failures. For agents that loop on tool outputs, even a 1% parse failure rate compounds into 10%+ task failure over 10 steps.
What they want: Whether you understand failure compounding in multi-step systems.
Prompt Engineering and Evaluation Questions
These overlap with RAG and agents but get asked as a distinct round at companies that ship LLM-heavy products.
31. Few-shot versus chain-of-thought. When do you use each?
Answer: Few-shot gives the model 2-8 input-output examples to imitate a format or style, and it works for classification, extraction, and tone matching. Chain-of-thought asks the model to reason step by step before answering, and it works for math, multi-step logic, and any task where the model gets the right answer with thinking and the wrong one without.
What they want: Whether you reach for the right prompting pattern by reflex.
32. Why might few-shot examples hurt performance?
Answer: Examples bias the model toward the patterns shown, so if your few-shot set is unrepresentative or contains errors, the model imitates the errors. Also, every example eats context and inference cost, so on tasks where the model already gets it right, few-shot adds cost without lift.
What they want: Whether you measure the impact of prompt changes or just stack patterns.
33. What is an LLM-as-judge, and what are its failure modes?
Answer: LLM-as-judge uses a model to score another model's output against a rubric, which is how you scale eval beyond what humans can label. Failure modes include position bias (prefers the first option), length bias (prefers longer answers), self-bias (judges its own family more favorably), and rubric drift (vague rubrics get inconsistent scores).
What they want: Whether you trust LLM-as-judge blindly or you calibrate it.
34. How do you build a golden eval set?
Answer: Start with 50-200 real user queries sampled from production, label expected behavior with a domain expert, and stratify by query type so each bucket has enough examples. Grow it over time by adding every regression and every customer complaint as a new test case.
What they want: Whether you have built one or just read the LangSmith docs.
35. What is prompt injection, and how do you defend against it?
Answer: Prompt injection is when user input contains instructions that override the system prompt ("ignore previous instructions and..."). Defenses include treating user input as untrusted data with clear delimiters, instruction hierarchy in the prompt, output filtering, and for agent systems, scoping permissions so a successful injection cannot cause damage.
What they want: Whether you take LLM security seriously.
36. How would you A/B test two prompts in production?
Answer: Random-assign each request to variant A or B at the request level, log both the output and a downstream success metric (user thumbs-up, task completion, conversion), and run until you have statistical significance on the metric you care about. Do not eyeball outputs and pick the one that "looks better."
What they want: Whether you run experiments or write opinions.
37. How do you keep a system prompt from drifting as the product grows?
Answer: Treat the system prompt as versioned code with a changelog, run the full eval set on every change, and require a PR review before merging prompt diffs. Most prompt regressions come from someone tacking on "and also..." without checking the impact on the existing test cases.
What they want: Whether you have lived through a prompt regression that broke prod.
38. What is constrained decoding?
Answer: Constrained decoding restricts the model's sampling distribution at each token to only valid next tokens given a grammar or schema, guaranteeing parseable output. It is implemented at the inference layer (vLLM, Outlines, OpenAI structured outputs) and is the only reliable way to get JSON out of an LLM at scale.
What they want: Whether you know there is a better answer than "just retry the parse."
39. How do you handle non-determinism in evals?
Answer: Run each eval N times (typically 3-10) and report mean and standard deviation, or set temperature to 0 for deterministic outputs and accept that you are not testing the full distribution. For agent evals, you almost always need N>1 because tool ordering varies even at temperature 0.
What they want: Whether you have been bitten by "my eval passed yesterday."
40. What is the difference between an offline eval and an online eval?
Answer: Offline evals run against a fixed golden set and gate deploys, optimizing for catching regressions before users see them. Online evals run against live traffic (success rate, user feedback, conversion) and catch the gaps your golden set missed, especially distribution shift.
What they want: Whether you ship with both, or just one and a prayer.
System Design Questions for AI Roles
This round is closest to a classic system design loop but with LLM-specific failure modes and cost math.
41. Design a ChatGPT-style assistant with multi-turn conversation
Answer: Frontend streams tokens via SSE, backend orchestrator manages session state (Redis), retrieval layer pulls relevant past messages and external docs, prompt assembler builds the final context, inference layer hits the model, and a logging layer captures every turn for eval. Add per-user rate limiting, cost tracking, and a fallback model for when the primary is rate limited.
What they want: Whether you can architect an LLM product, not just a CRUD app.
42. Design semantic search over 10M product descriptions
Answer: Batch embed all products with a strong embedding model, store in a managed vector DB (Pinecone, Qdrant, pgvector) with HNSW index, expose a query endpoint that embeds the query and returns top-K with metadata filters. Add a reranker for the top 50, a cache for popular queries, and a backfill pipeline for new products.
What they want: Whether you can size storage, latency, and cost.
43. Design a customer support agent
Answer: Intake classifier routes the conversation to "answerable by docs" (RAG path) or "needs action" (agent path with tools for refunds, account lookup, ticket creation). Both paths log full traces, escalate to human after N failed attempts or low confidence, and write back to the CRM. Eval set covers the top 50 ticket types.
What they want: Whether you understand that agents should not be in the critical path of every conversation.
44. Design multi-tenant LLM serving
Answer: Per-tenant API keys, per-tenant rate limits, per-tenant cost budgets enforced at the gateway, and tenant ID propagated into prompts so cross-tenant data leaks are impossible by construction. Serve from a shared model with batched inference for throughput, and route premium tenants to a faster lane.
What they want: Whether you have shipped a B2B LLM product.
45. How would you reduce cost on an LLM product 50%?
Answer: First, route easy queries to a smaller cheaper model with a classifier or a model cascade (start cheap, escalate on low confidence). Then cache common queries, compress prompts (remove redundant instructions, switch to terse formats), and batch where latency permits. Track cost per query as a deploy-gated metric.
What they want: Whether you treat unit economics as an engineering problem.
46. Your p95 latency is 4 seconds and target is 1 second. What do you change?
Answer: Stream the response so time-to-first-token drops to 200-400ms even if total stays 4s, then look at the latency budget breakdown (retrieval, prompt assembly, model call, post-processing) and attack the biggest bucket. Often the win is a smaller model, parallel retrieval, or skipping the reranker on easy queries.
What they want: Whether you understand streaming and budget decomposition.
47. How do you handle rate limits from a third-party model API?
Answer: Add a retry layer with exponential backoff plus jitter, queue requests with a token bucket per API key, and fall back to a secondary provider (different model family or self-hosted) when the primary is hard-limited. Pre-warm capacity at peak times if your provider supports provisioned throughput.
What they want: Whether you have shipped against a rate-limited dependency.
48. Design observability for an LLM product
Answer: Log every request with input, full prompt, model response, latency, token counts, and cost. Add structured spans for retrieval, reranking, tool calls, and post-processing. Surface user feedback (thumbs, edits, conversions) as a separate stream joined back to traces, and feed the bad ones into the eval set.
What they want: Whether you ship observability before or after the first incident.
49. How do you cache LLM responses without breaking personalization?
Answer: Cache at the embedding level (semantic cache: if the new query embedding is within ε of a cached one, return the cached answer), keyed by user-invariant inputs only. For per-user content, cache the expensive prefix (retrieval results, document embeddings) and only call the model on the final composition step.
What they want: Whether you understand both caching layers in an LLM product.
50. How do you decide between an open-source model and a frontier API?
Answer: Frontier APIs win on quality and time to market with zero infra. Open source wins on cost above ~1B tokens/month, data residency or privacy requirements, fine-tuning needs, and latency-sensitive workloads where you can co-locate inference. Run both in shadow mode on real traffic before committing.
What they want: Whether you make the call with data instead of ideology.
How to Prepare: A 3-Week Study Plan
Three weeks of focused prep, 90 minutes a day, is enough to be ready for most AI engineer loops. More than three weeks usually means you are avoiding mock interviews.
Week 1: Foundations. Read one transformer explainer (Jay Alammar's still works), one RAG paper (the original RAG paper plus one recent survey), and one agent paper (ReAct, plus one read on planner-executor patterns). Each day, write a 200-word summary of one concept in your own words. The questions in section 2 of this guide cover the foundations you need to recall cold.
Week 2: Build something. Ship a small RAG app over a corpus you care about (your Notion, a codebase, a textbook). Then bolt a tool-using agent onto it. Write five evals for each, run them in CI, and break your own system with adversarial inputs. This is where 80% of the learning happens. If you cannot build a working RAG in a week, the system design round will eat you alive.
Week 3: Mocks and system design. Two mock interviews on agent design, two on RAG, two on system design for LLM products. Time-box each to 45 minutes, record yourself, and after each session write down the three concepts you fumbled and one thing you would say differently. Pair this with timed reps on the 50 questions in this guide.
Daily metrics to track: how many questions you can answer in under two minutes, how many concepts you had to look up, and your mock interview success rate. If those numbers are not moving by end of week 2, your prep plan is wrong, not the field.
Related Reading
Land the AI Engineer Role
The 50 questions above are the floor, not the ceiling. The candidates who get offers in 2026 are the ones who have built something with their hands, can explain tradeoffs without reaching for jargon, and treat evals as a first-class engineering discipline instead of an afterthought.
If you want timed reps in the same environment as a real loop, Interview Coder runs mock AI engineer sessions with live feedback on your answers and full session logs so you can see where you actually improved. 75% of users landed offers within three months. The grind is still the grind, but you can stop guessing whether your prep is working.
