Key takeaways

Prompt engineer roles pay $120k to $250k base in 2026, with Anthropic, OpenAI, Scale AI, and Cohere posting most senior listings. Smaller labs hire generalists who can also run evals and tool plumbing.
Loops usually run 4 to 6 rounds: recruiter screen, live prompt-writing exercise, structured-output round with function calling, eval design round, behavioral case study.
Interviewers grade taste. They want you reaching for chain-of-thought when reasoning matters, JSON mode when structure matters, and small evals when you don't yet know if a prompt actually works.
Most candidates fail the eval round. They write a prompt, ship it, and never explain how to catch regressions when the model version flips. Bring a story about a golden set you built and what it caught.
This guide stacks 40 questions across fundamentals, reasoning, tool use, eval, and system-prompt patterns, with the signal each one tests.

Prompt engineer interviews in 2026 follow a recognizable loop: recruiter screen, live prompt-writing, structured output with function calling, eval design, behavioral. The full loop runs 2 to 4 weeks at frontier labs and a week or less at smaller AI startups.

Below is a Q&A bank covering the 40 questions that come up most often. If you want timed mocks in the same format, Interview Coder runs prompt-writing and eval drills under a clock.

What Is A Prompt Engineer In 2026

A prompt engineer owns the natural-language interface between a product and one or more language models: system prompts, few-shot examples, output schemas, tool definitions, eval suites, and the playbook for when a model version flips.

The role overlaps with two adjacent ones. AI engineers ship the full stack (retrieval, vector stores, orchestration). LLM engineers lean toward research (fine-tuning, RLHF data, pretraining infra). Prompt engineers sit between them.

US base comp runs $120k to $250k for individual contributors, plus equity. Anthropic, OpenAI, Scale AI, and Cohere post most senior listings. Notion, Linear, Ramp, Vercel, and Perplexity hire generalists who can write a clean system prompt and stand up a basic eval pipeline.

The loop: recruiter screen (30 min), live prompt-writing (60 min), structured output + tool use (60 min), eval design (60 min), system prompt case study or take-home, behavioral (45 min). The middle three rounds are where most candidates lose.

Prompt Fundamentals (10 Questions)

1. System prompt vs user prompt?

System prompts set persistent behavior: role, tone, output format, refusal policy. User prompts are per-turn input. Production setups put policies, schemas, and tool descriptions in the system slot and treat the user slot as untrusted.

Testing: whether you understand the trust boundary and why injection defenses live in how you handle the user slot.

2. When does few-shot prompting actually help?

When the task has a specific output shape the model doesn't infer from instructions alone, or a niche style convention. It rarely helps on simple reasoning and can hurt by anchoring the model to surface patterns in your examples.

Testing: whether you reach for few-shot reflexively or only when zero-shot fails on your eval set.

3. How do you pick few-shot examples?

Diversity over count. Three or four covering real variation beat ten near-duplicates. Sort so the most relevant example is closest to the query, because recency bias is real.

Testing: whether you've built a few-shot prompt at scale or just read about it.

4. What is role prompting and when does it backfire?

Role prompting cues register and vocabulary. It helps on stylistic tasks. It backfires when the role implies expertise the model can't ground (medical, legal, financial), because it makes hallucinations sound confident.

Testing: whether you separate stylistic priming from actual capability.

5. How do you write a constraint the model actually follows?

Place it where the model attends most: last in the system prompt, or restated right before user input. State it positively ("Respond in exactly two sentences") not negatively. Add a one-shot example that obeys it.

Testing: whether you've debugged a model ignoring instructions and figured out why.

6. How do you handle a prompt that has to do five things at once?

Decompose into a pipeline of smaller prompts, or use a structured output that forces the model to address each step in a named field. Monolithic prompts work on easy tasks but fail when any sub-task is hard.

Testing: whether you think in pipelines or hope a single big prompt will work.

7. How do delimiters affect prompt quality?

Triple backticks, XML tags, and labeled headers help the model separate instructions from input. Anthropic recommends XML tags around content blocks when a prompt has multiple sections. The win is small but consistent on long prompts. (Anthropic prompt engineering docs)

Testing: whether you've internalized tactical conventions.

8. What is prompt chaining?

Using the output of one prompt as input to the next. Use it when a task has clear stages (extract, classify, summarize) or when intermediate outputs need inspection. Cost: more tokens, more failure modes. Chain only when a single prompt underperforms.

Testing: whether you can name the tradeoff, not just the technique.

9. How does temperature affect prompt design?

Low (0 to 0.3) for deterministic tasks: extraction, classification, structured generation. Higher (0.7 to 1.0) for creative output. For evals, fix at 0 so you can compare runs.

Testing: whether you treat temperature as a per-task dial.

10. Zero-shot CoT vs few-shot CoT?

Zero-shot prepends "Let's think step by step" and lets the model generate its own reasoning. Few-shot shows worked examples. Zero-shot is cheaper and strong on modern models; few-shot wins on unusual or domain-specific reasoning styles.

Testing: whether you know the Wei et al. 2022 result and have used both.

Chain Of Thought And Reasoning (8 Questions)

11. When should you use chain-of-thought?

Multi-step reasoning: math, logical deduction, multi-hop questions. CoT hurts on direct tasks (single-fact lookup, simple classification): adds latency and can drift.

Testing: whether you can name a task where CoT hurt and you removed it.

12. What is tree-of-thoughts and when is it worth the cost?

ToT explores multiple reasoning paths in parallel and prunes by a score function. Worth it on hard search problems where a single chain dead-ends. Overkill for most production tasks because latency and token cost are 5 to 10x a single CoT.

Testing: whether you reach for ToT to sound smart or only when the problem needs it.

13. Explain ReAct.

ReAct interleaves reasoning with actions (tool calls). The model thinks, acts, observes, thinks again. It's the basis for most agent loops because it grounds reasoning in external state instead of hallucinating it.

Testing: whether you can map the original paper to a production agent.

14. What is self-consistency?

Generate multiple reasoning chains at high temperature, take the majority answer. Beats single-chain CoT on math by a few points but costs N times more tokens. Use it for high-stakes single-shot decisions, not high-volume tasks.

Testing: whether you can name the cost/benefit.

15. How do you handle a CoT that arrives at a confident wrong answer?

Add a verification step where the model checks its reasoning. Decompose and verify each sub-step. Route to a tool (calculator, code interpreter, retrieval) for parts the model is bad at. Don't trust the chain just because it's long.

Testing: whether you've seen confident garbage and built a defense.

16. Reasoning-model thinking tokens vs standard CoT?

Reasoning models (Claude extended thinking, OpenAI o-series) produce internal thinking outside the final response, with a tunable compute budget. Standard CoT is just text in the response. Reasoning models usually beat CoT prompting on hard problems but cost more.

Testing: whether you've kept up with the model lineup.

17. When does CoT make safety worse?

A chain can rationalize into outputs the model would refuse zero-shot. Keep refusal logic outside the chain (in a router or post-processor) and don't let the model see its own past justifications.

Testing: whether you think about CoT failure modes, not just capability gains.

18. How do you debug a reasoning failure?

Log the full chain. Find the wrong step. Three common failures: a hallucinated early fact that propagates, an unjustified logical leap, or a sub-task the model is bad at (arithmetic, date math). Fixes map to grounding, decomposition, or tool use.

Testing: whether you treat prompts like code that needs debugging.

Structured Output And Tool Use (8 Questions)

19. What is JSON mode and when is it not enough?

JSON mode forces syntactically valid JSON. Not enough when you need a specific schema, because valid JSON can still miss the shape your code expects. For schema enforcement, use function calling or constrained decoding.

Testing: whether you've shipped JSON-mode output and seen valid-but-wrong shapes.

20. How does function calling work under the hood?

The model gets tool definitions (name, description, parameter schema). When it decides to call one, it emits structured output matching a schema. Your code parses, executes, feeds the result back. Models are trained to emit calls when descriptions match query intent. (OpenAI cookbook.)

Testing: whether you understand the loop, not just that it exists.

21. How do you write a tool description the model calls correctly?

Treat it like a prompt. Name with a verb (search_orders, not orders_api). Describe what it returns. List inputs with examples. State when the model should NOT call this tool. The biggest win is usually a "When to use this" section.

Testing: whether you've debugged a model picking the wrong tool.

22. How do you handle a model that hallucinates a function it doesn't have?

Validate the function name against your registry and return a structured error on mismatch. Add a closed-world instruction in the system prompt naming the available tools. If hallucination persists, switch to a model with stronger tool calling.

Testing: whether you handle the failure in code or hope it doesn't happen.

23. What is constrained decoding?

The model is only allowed to emit tokens matching a grammar (JSON schema, regex, BNF). Libraries like Outlines enforce this at sampling. Trade-off: guaranteed valid output, but quality drops because the model gets locked into a path early.

Testing: whether you know the technique and when to use it vs. soft prompting.

24. How do you reduce hallucinations in a RAG pipeline?

By impact: better retrieval (recall matters more than rerank), require span-level citations, refuse when retrieval returns nothing relevant, add a verification step that re-reads cited spans, use a cheaper model to fact-check the larger one.

Testing: whether you've built RAG and watched it lie.

25. Parallel vs sequential tool calls?

Parallel: the model emits multiple calls at once, your code runs them concurrently. Sequential: each call depends on the previous result. Parallel is faster but only works when calls are independent. Modern models handle parallel natively.

Testing: whether you've optimized agent latency.

26. How do you handle a tool result too long for the context window?

Summarize before returning. Chunk and paginate via follow-up calls. Filter to only the fields the model needs. Most agent bugs trace back to dumping raw API responses into context.

Testing: whether you treat context economy as a first-class concern.

Evaluation And Testing (8 Questions)

27. How do you build a golden set for a prompt?

Start with 20 to 50 hand-labeled examples covering the variation you care about: common, edge, adversarial, and the cases you saw fail in prod. Version in git. Re-run on every prompt change. Grow it every time a regression slips through.

Testing: whether you've built and maintained one, not just read about evals.

28. What is LLM-as-judge and when does it break?

A separate model call grades output against a rubric, scaling human review to thousands of examples. Breaks when the judge is biased (prefers verbose answers, prefers its own style), the rubric is ambiguous, or the judge is the same model being judged. Calibrate against human labels.

Testing: whether you trust LLM judges blindly or treat them as a noisy signal.

29. How do you detect a prompt regression after a model version flip?

Run the golden set on the new model and diff against the previous run. Flag examples where output changed materially or judge score dropped. For production, ship the new model behind a flag and shadow-compare on real traffic for 1 to 7 days.

Testing: whether you have a defense for the inevitable version change.

30. How do you A/B test prompts in production?

Hash the user ID to a variant, log it alongside the response, define a downstream success metric (thumbs up, follow-up, task completion). Avoid proxy metrics like response length. Hold each variant a week minimum. Bonferroni-correct beyond two variants.

Testing: whether you've actually run a prompt A/B test.

31. What metrics matter for a chatbot prompt?

Task-specific beats generic. Support: resolution rate, escalation rate, CSAT. Sales: qualified-lead rate. Internal tools: time-to-answer. BLEU and ROUGE rarely matter outside summarization research. If you have one metric, use thumbs-up calibrated against human review.

Testing: whether you know prompt eval is product eval, not NLP eval.

32. How do you eval an agent loop?

Score the full trajectory, not just the final answer. Three signals: right tools, right order, right output. For each failed trajectory, label the step where it went wrong. Most agent failures are early-step routing errors, not late-step generation errors.

Testing: whether you decompose agent failures or treat the loop as a black box.

33. Right size for an eval set?

Early development: 20 to 50 hand-labeled. Stable production: 200 to 2000, stratified across categories you care about. Beyond a few thousand, marginal value drops fast; invest in eval quality over quantity.

Testing: whether you know diminishing returns kick in early.

34. How do you handle subjective outputs in evals?

Define a rubric with explicit criteria. Two humans label a subset; measure inter-rater agreement. Below ~70%, the rubric is too vague. Use agreed labels to calibrate an LLM judge. Lilian Weng's posts cover eval patterns worth reading first.

Testing: whether you can ground subjective evals in measurable agreement.

System Prompt Engineering (6 Questions)

35. What goes in a production system prompt?

Roughly: identity and role, current date and context variables, available tools, output format, refusal rules, edge case handling, examples. Keep policy in system; treat user input as untrusted. Production system prompts at scaled companies usually run 500 to 3000 tokens.

Testing: whether you've read a real production system prompt.

36. How do you defend against prompt injection?

Layered defenses. Separate user content from instructions with XML tags or JSON envelopes. Never concatenate user text into instructions. For agents, validate tool calls against an allowlist. For RAG, treat retrieved content as untrusted and tell the model to ignore embedded instructions. Assume any single defense will fail.

Testing: whether you have multiple layers, not a single magic phrase.

37. How do you manage a long context window without losing the plot?

Three patterns. Summarize history into a running summary. Use retrieval over conversation history so only relevant past turns are included. For agents, separate working memory from long-term memory. Long-context models hold 200k+ tokens but degrade on retrieval in the middle.

Testing: whether you've fought a long-context bug.

38. How do you version a system prompt?

Like code. Store in git, tag with a semantic version, pin prompt version alongside model version in production calls, log both. Run the eval set on every change. Roll back like a deploy.

Testing: whether you treat prompts as production artifacts.

39. How do you handle a system prompt that's grown to 5000 tokens?

Audit it. Most have dead instructions from past bugs, duplicated rules, and examples that don't move the eval needle. Run an ablation: remove each section, measure eval delta, cut what doesn't help. Split into a base prompt plus per-route prompts.

Testing: whether you'd refactor a bloated prompt like code.

40. What is prompt caching and when does it pay off?

Frontier models cache the prefix across calls. If your system prompt is 3000 tokens and you serve 100k calls a day, caching saves 90%+ of prompt cost. The win shows up when the cached portion is large and stable and you call the same prompt many times in a short window. (Anthropic prompt caching)

Testing: whether you think about cost at production scale.

How To Prepare

Reading question banks gets you 30% of the way. The other 70% is reps.

Build a real prompt suite. Pick a small task (summarize a support ticket, classify GitHub issues, extract data from invoices). Write zero-shot, few-shot, and CoT versions. Build a 30-example golden set. Score with an LLM judge. Iterate. Now you have something to talk about in the case study.

Practice the structured output round. Take an API you know. Define five tools. Build an agent loop. Log every tool call. Walk through it out loud. This is the exercise most candidates skip and the one that comes up most in senior loops.

Read the canonical references. The Anthropic prompt engineering docs, the OpenAI cookbook, and Lilian Weng's posts on agents cover most of what comes up.

Run mock interviews. Live prompt writing under a 60-minute clock is its own skill. The model output is unpredictable, the interviewer watches how you react, and time goes fast. Interview Coder runs AI mock sessions matching the prompt-writing and eval-design rounds at frontier labs. Its answers run on the latest Claude models — useful when the question is literally about how frontier models behave.

Prompt engineering interviews reward people who have shipped and can talk through the failures. Bring two or three stories: a prompt you rewrote three times before it stopped hallucinating, an eval set that caught a regression, a tool description you fixed and watched call accuracy jump from 60% to 95%. That's the texture interviewers remember.