Agentic AI moved from research demo to production stack in about 18 months. Senior agentic engineer comp now sits at $140-300k base depending on city and stage, with the top bands at frontier labs and well-funded AI infra startups. Hiring managers stopped asking "what is an agent" two years ago. Now they ask whether you've shipped one that survived a week in prod without burning $10k on runaway tool calls.
This guide is a Q&A walkthrough of the rounds you'll face. For timed mocks against this rubric, Interview Coder runs the same drills with live feedback.
Agentic AI in 2026

Two years ago "agent" meant a notebook that called an API and pretended to think. In 2026 it's a loop that plans, calls tools, observes results, decides what's next, and runs without a human babysitting every step. The rubric:
Anthropic's Building Effective Agents is the most-cited reference this year. Read it before any onsite.
Comp ranges: frontier labs $220-300k + equity, AI infra startups $180-260k, big tech AI teams $180-250k, Series A-C product companies $140-200k.
Fundamentals (7 Questions)
First 30 minutes of the screen. Short answers, clean tradeoffs.
1. Walk me through ReAct. Why does it work?
ReAct interleaves Reasoning and Acting: model writes a thought, picks a tool, observes the result, repeats. It works because the observation grounds the next step in real data instead of letting the model hallucinate forward. Risk is verbosity and token cost since every loop adds context.
Testing: Whether you understand the loop primitive, not whether you memorized the paper.
2. Plan-and-execute vs ReAct. When do you pick which?
Plan-and-execute writes the full plan upfront, then executes. ReAct decides each step in the loop. Plan-and-execute is cheaper when the task decomposes cleanly. ReAct wins when later steps genuinely depend on what earlier steps return (debugging, exploratory analysis).
Testing: Whether you justify architecture from task structure, not preference.
3. What's reflection in an agent loop?
Reflection is when the agent reviews its own output or trajectory and decides whether to retry, refine, or accept. Two patterns: self-critique on the final answer, or step-level reflection that catches a bad tool call before the next step compounds the error. Roughly doubles cost.
Testing: Whether you've actually built one. Anyone who's shipped reflection knows the cost hit.
4. Define "tool use" without marketing fluff.
The model emits a structured call (JSON matching a schema), your runtime executes it, the result feeds the next model call. Function calling is the API mechanic; tool use is the pattern. The hard part is schema design and error handling.
Testing: Whether you separate the API feature from the design problem.
5. Autonomous vs assisted agents. Where's the line?
Assisted agents wait for confirmation before consequential actions (sending email, writing to a database, spending money). Autonomous agents act inside a sandbox with budget and capability limits. Draw the line by blast radius: if a wrong action costs money, leaks data, or wakes someone up, gate it behind a human.
Testing: Whether you'll ship something reckless.
6. Why do agents loop forever, and how do you stop it?
Three causes: model can't tell the task is done, a tool keeps returning an error the model can't handle, or the model fixates on a sub-goal. Stop it with hard step limits, budget caps (tokens and dollars), repeated-state detection, and an explicit termination check. Step cap is the cheapest defense.
Testing: Whether you've debugged a runaway loop at 3am.
7. What's the simplest agent you'd build today?
A single-tool ReAct loop with a 5-step cap and structured logging. One tool, a system prompt that defines termination, a wrapper that records every step. Anything more without a reason is over-engineering.
Testing: Whether you reach for complexity or simplicity by default.
Frameworks (8 Questions)
Every team picked sides and is justifying the call. The trap is naming a favorite without owning the tradeoffs.
8. LangGraph vs CrewAI vs AutoGen vs custom. How do you choose?
Testing: Whether you can defend a pick without bashing the others.
9. When would you build a custom agent loop?
When the loop is small enough that the framework adds more than it saves (~50 lines for single-tool ReAct). Also when you need control the framework doesn't expose: custom retry logic, non-standard state, specific observability hooks. Frameworks earn their weight on multi-agent and long-running stateful systems.
Testing: Whether you reach for frameworks reflexively.
10. Explain LangGraph's state graph model.
A directed graph: nodes are functions (LLM call, tool call, decision), edges are transitions. State is a typed dict that flows through nodes; each node returns updates that get merged. Edges can be conditional. Checkpoints persist state so you can resume after a crash. See LangGraph docs.
Testing: Whether you've used it past the quickstart.
11. CrewAI uses roles. What's the tradeoff?
Roles give you fast prototyping (researcher + writer + critic in 20 lines) and a mental model that ports to non-engineers. Tradeoff: role boundaries are convention, not enforcement. Agents leak responsibilities and debugging gets murky. Fine for demos, painful in prod beyond a certain scale.
Testing: Whether you've felt the pain.
12. What's wrong with AutoGen's conversational pattern?
Nothing inherent, but conversation as orchestration means every message gets re-ingested by every agent, inflating token cost and latency. Works for research where the conversation IS the artifact. Bad fit for high-volume production where structured handoffs win.
Testing: Whether you understand the cost model, not just the feature list.
13. How do you handle framework lock-in?
Keep business logic (tools, prompts, eval harness) out of the framework. The framework wraps orchestration; everything else lives in your own modules. Switching costs a week, not a quarter. I migrated a prod agent from LangChain to LangGraph in 3 days because tools and prompts were portable.
Testing: Whether you've been burned by it.
14. Pick one framework you'd never use in prod. Why?
Any framework where the abstraction hides the prompt or the loop semantics. If I can't see what the model is being asked and what the runtime does between calls, I can't debug it. Rules out a few "no-code agent" platforms regardless of demo polish.
Testing: Whether you have taste, not whether you trash-talk.
15. Message-passing vs shared-state architectures?
Shared state (one dict every node reads and writes) is simpler for short flows. Message passing (typed messages, no shared mutable state) scales better with many agents or distributed execution. Default to shared state until pain forces a move.
Testing: Whether you reach for distributed systems vocab without justification.
Tool Use and Function Calling (6 Questions)
The round where the screen shares a JSON schema and asks you to fix it.
16. How do you design a tool schema the model will call correctly?
Three rules: descriptive parameter names, examples in the description, strict typing. If a parameter could be string or number, pick one and validate. Use enum for known values. Keep the schema small; models call short schemas more reliably than 15-parameter monsters. The Berkeley Function Calling Leaderboard shows model-specific failure patterns worth knowing.
Testing: Whether you've debugged a tool the model wouldn't call.
17. The model calls a tool and gets an error. What do you feed back?
Structured: error type, message, hint about valid input. Don't paste the stack trace. Don't swallow and return success. I return {"error": "<type>", "message": "<short>", "hint": "<recovery>"}.
Testing: Whether you understand error handling shapes the next loop iteration.
18. Parallel tool calls. When and how?
When the model emits multiple tool calls in one turn and the tools are independent, execute in parallel. Saves latency proportional to the longest call. Anthropic and OpenAI APIs support this natively. Don't parallelize across turns; that's concurrency, not parallelism.
Testing: Whether you know the API supports it and you've used it.
19. Tool results exceed the context window. What do you do?
Summarize before feeding back, store the full result in a side-channel the model can re-query, or paginate. Cheapest is summarization with a small model. Most accurate is full storage + re-query.
Testing: Whether you've hit this in production.
20. Structured outputs vs tool use. What's the difference?
Structured outputs constrain the final answer shape. Tool use lets the model call functions. Both use JSON schema, which is why they get confused. Use structured outputs when you want a typed response. Use tool use when you want the model to take actions and observe results.
Testing: Whether you can keep the two patterns straight.
21. The model is making up tool names. Why?
Three causes: tools weren't passed to that API call, the prompt mentions tools that aren't available, or the model's tool-use training is weak for that schema shape. Fix in that order. Usually it's the prompt promising capabilities you didn't wire up.
Testing: Whether you debug systematically or guess.
Human-in-the-Loop and Checkpointing (5 Questions)
Design round questions. They want to know if you'd let your agent send the email.
22. When do you interrupt for human approval?
Before any high blast radius action: spending money, external communication, writing to prod data, anything irreversible. Also when confidence on a critical decision drops below a threshold. Cheap actions (search, read, internal queries) don't need approval. Rule of thumb: would you let a junior engineer do this unsupervised?
Testing: Whether you have judgment about blast radius.
23. How do you persist agent state for resumption?
Two layers: a checkpoint of full agent state (messages, intermediate results, tool history) and a separate event log for audit. Checkpoints go in a durable store keyed by run_id; LangGraph's checkpointer is the cleanest reference. Checkpoint after every node.
Testing: Whether you've thought about crash recovery.
24. The agent crashes mid-run. Walk me through resuming.
Load the latest checkpoint for the run_id, validate the state schema still matches, re-enter the loop at the next node. The tricky part is partial side effects: if the agent already sent the email, you don't want to send it twice. Idempotency keys on every outbound tool call solve this.
Testing: Whether you understand idempotency in agent context.
25. How do you let a human edit the agent's plan mid-run?
Pause at a checkpoint, surface state and proposed next step in a UI, let the human edit the plan or any intermediate result, resume with edited state. LangGraph's interrupt mechanism is built for this. The UX is harder than the engineering.
Testing: Whether you've thought about the human side.
26. What's the cost of checkpointing?
Storage and write latency. For most agents both are small relative to model cost, so default to checkpoint-everything and optimize on pressure. The trap is checkpointing large tool results inline; reference them by ID instead.
Testing: Whether you measure before optimizing.
Eval and Observability (5 Questions)
The round that filters senior candidates. Can you tell whether your agent is actually getting better?
27. How do you eval an agent vs a single-LLM app?
Single-LLM: input -> output, score the output. Agent: input -> trajectory -> output, score both the final answer AND the trajectory (which tools were called, in what order, whether intermediate steps were correct). Trajectory eval catches agents that get the right answer the wrong way, which usually means they'll fail differently next run.
Testing: Whether you understand the eval shape change.
28. What's a golden trajectory?
A reference trajectory for a known input: the ideal sequence of tool calls and intermediate results an expert (or your best agent) produces. Score new runs against the golden by step-level match. Building goldens is expensive (humans author them) but it's the only way to get reliable regression signal.
Testing: Whether you've actually run this kind of eval.
29. How do you track cost per task?
Instrument every model call and tool call with cost (input tokens * input price + output tokens * output price + tool overhead). Aggregate by run_id, then by task type. Dashboards: p50/p95 cost per task, cost per successful task (failures cost money too), trend over time. Helicone and LangSmith both have this out of the box.
Testing: Whether you've owned a cost line item.
30. LangSmith vs Helicone vs custom. What do you use?
LangSmith if you want trace UI and eval together (LangChain ecosystem). Helicone for a proxy-based approach across providers with minimal code change. Custom for compliance needs or integration with existing observability (Datadog, Honeycomb). I default to Helicone and add LangSmith if the team is already on LangGraph.
Testing: Whether you've shipped observability, not just installed it.
31. Agent passes eval in dev but fails in prod. Where do you look?
Distribution drift first: prod input distribution vs eval set? Then tool reliability (eval tools are mocks, prod tools fail differently). Then model version (did the provider silently update?). Then long-tail prompts (eval covers happy path, prod has the weird stuff). Add prod sampling to your eval set every week.
Testing: Whether you've debugged the eval-prod gap before.
Production System Design (4 Questions)
Final round. Whiteboard, design something real.
32. Design a research agent that handles 10k requests per day.
Queue + worker. API takes the request, drops a job on a queue (SQS, Redis), workers pick up jobs and run the agent loop. Each worker has a budget cap (tokens + wall time). Checkpointing to a durable store. Results back via webhook or polling. Capacity math: if each agent takes 30s and costs $0.10, 10k/day = ~3.5 concurrent workers, ~$1000/day in model cost. Scale workers horizontally on queue depth.
Testing: Whether you can do capacity math and pick the right primitives.
33. Multi-agent coordination. When do you actually need it?
Rarely. Most "multi-agent" systems are one orchestrator that calls specialized sub-routines. True multi-agent makes sense when tasks decompose cleanly along agent boundaries, agents need different tool access (security), or you want to parallelize genuinely independent sub-problems. If you can solve it with one agent and a tool list, do that.
Testing: Whether you'll over-engineer.
34. Latency budget for an interactive agent. How do you set it?
Budget = user tolerance, typically 5-15s for chat, 30s+ for batch. Decompose: model latency (by model and token count), tool latency, loop count (each loop is one round-trip plus tool calls). Set per-step budgets and fail fast when exceeded. If total exceeds tolerance, switch to streaming with progress updates or go async.
Testing: Whether you understand latency as a budget, not a property.
35. How do you contain costs when an agent goes off the rails?
Five layers: hard step cap (e.g., 10 max), token budget per run, dollar budget per run, repeated-state detection (same tool call twice triggers a halt), circuit breakers on tools that error repeatedly. Plus alerting on cost-per-task p95 to catch drift before it becomes a bill. Every prod agent I've shipped has all five.
Testing: Whether you've actually had a runaway and learned from it.
How to Prepare
Fastest way to interview well for agentic roles: ship one. Reading docs gets you through screens. Onsites filter on whether you've run a loop in anger.
Projects worth building
Ship one to a public URL with cost tracking and trajectory logging. That's the artifact you reference in interviews.
Resources
Drill the rounds, not the theory
Reading won't get you past the eval round. You need reps explaining trajectory eval out loud, designing tool schemas live, and defending framework choices under pushback. Interview Coder runs timed mocks against this rubric with live feedback. Two mocks a week for three weeks and the answers stop feeling rehearsed.