Blog

How Claude Code is dealing with prompts, two cases analysed

I intercepted every API call that Claude Code makes and compared two sessions side by side: a simple greeting ("hi, how are you?") and a full security review of an Express API. The difference in what happens behind the scenes is revealing. Same tool, same system prompt, wildly different execution paths. If you want to understand how Claude Code actually works under the hood, this is the post for you.

The setup: intercepting Claude Code's API calls

I used Vistaclair, a control-room server I built, to proxy and record every HTTP request Claude Code makes to Anthropic's API. Each request/response pair gets logged as a JSON file with full headers, body, SSE events, timing, and token usage. The greeting session produced 8 recorded interactions. The security review produced 81. That ratio alone tells you something.

What happens when you type a single message

Before Claude even sees your prompt, Claude Code fires off multiple operations in parallel. Here is the exact sequence for both sessions:

Claude Code: What fires before your prompt runs t=0ms 1. Quota Check (Sonnet, max_tokens=1) 341ms, got 429 rate limit 2. SessionStart Hook 0ms (local event) 3. UserPromptSubmit Hook 0ms (local event) 4. Title Generation (Sonnet) JSON schema output, temperature=1 441 in / 11 out tokens 5. Main LLM Call (Opus) Adaptive thinking, all tools loaded 19,788 cache_read 1,390 cache_create 22 output tokens 6. Stop Hook 7. Notification Hook Total: ~2.3 seconds
The lifecycle of a single Claude Code interaction, even for 'hi, how are you?'

Seven distinct operations to answer a greeting. The quota check alone takes 341ms. The title generation adds another 1.5 seconds. And the main call, the one that actually produces the response, takes 2.3 seconds total but only generates 22 tokens: "Hi! I'm doing well, thanks for asking. How can I help you today?"

The quota check trick
Claude Code sends a Sonnet request with max_tokens: 1 and the literal message "quota" to check if you're rate-limited. It expects a 429 (rate limit). If it gets one, it knows you're within limits (yes, that's counterintuitive). If it gets a different error, it means something else is wrong. This burns almost no tokens but adds ~340ms of latency to every session.

The system prompt: 19,000 tokens you never see

This is the most surprising finding. Both sessions, the greeting and the security review, send the exact same system prompt. It's roughly 19,000 tokens and contains:

  1. The full Claude Code identity and behavioral instructions (~3,000 tokens covering task execution, code style, tone, security practices)
  2. Complete tool definitions with JSON schemas for all 15+ tools: Agent, Bash, Edit, Read, Write, WebFetch, WebSearch, TaskCreate/Update/List/Get, Monitor, EnterPlanMode, ExitPlanMode, AskUserQuestion, Skill (~12,000 tokens)
  3. Git safety protocols, commit and PR creation workflows (~2,000 tokens)
  4. Session-specific context: working directory, git status, recent commits, platform info, model version (~1,000 tokens)
  5. A list of available skills (security-review, code-review, init, verify, etc.)
  6. Custom project rules (like 'never kill the Vistaclair server')

For "hi, how are you?", roughly 99.9% of those instructions are irrelevant. The system prompt describes git protocols, PR creation workflows, deployment safety, task tracking, and sub-agent orchestration. None of that gets used. But it all gets sent and processed anyway.

Why doesn't this cost a fortune?

Prompt caching. Look at the token counts from the greeting session's main call:

Token CategoryCountCost per MTokCost
Input (fresh)3$5.00$0.000015
Cache read19,788$0.50$0.009894
Cache create1,390$6.25$0.008688
Output22$25.00$0.000550
Total21,203~$0.019
Token usage for the Opus main call in the greeting session

Of the ~21,000 input tokens, 19,788 were cache reads at 0.50/MTokinsteadof0.50/MTok instead of 5.00/MTok. That's a 10x discount. The system prompt was already cached from a prior session (cache TTL is 1 hour). Only 1,390 tokens needed fresh caching: the user-specific parts like system reminders and the prompt itself. Without caching, that same call would cost ~0.105 in input alone. With caching, it's 0.019 total.

Cache placement matters
Claude Code sets cache_control: { type: 'ephemeral', ttl: '1h' } on the system prompt's last chunk and on the user's last message. This creates cache breakpoints that maximize reuse. The system prompt (which is identical across sessions) caches at the 1-hour tier. User messages cache at the same tier. Subsequent turns in the same session hit the cache nearly 100%.

The title generation sidecar

Both sessions fire a separate Sonnet call just to generate a session title. This call is architecturally interesting because it's completely stripped down:

  • Uses Sonnet (not Opus), cheaper at 3/3/15 per MTok
  • No tools are provided (empty tools array)
  • No thinking is enabled
  • Enforces JSON schema output: { "title": "string" }
  • Uses temperature: 1 (more creative than the main call)
  • System prompt is only ~400 tokens (just the title generation instructions)

The greeting produced {"title": "General greeting and conversation"}. The security review produced {"title": "Security review of API for internet exposure"}. Both took about 1 second and cost fractions of a cent. This is a nice pattern: use a cheap, fast model for metadata tasks that don't need reasoning power.

The security review: where it gets interesting

Now let's look at what happens when Claude Code actually has to think. The user asked: "make a security review of the API. is it ready for exposing on the internet?"

The main Opus call activated adaptive thinking (the thinking block exists but was empty in the intercepted data, likely redacted by the API). It immediately spawned a sub-agent to explore the codebase.

Security Review: Multi-Agent Orchestration Orchestrator (Opus) Adaptive thinking, all 15 tools, 19K sys prompt spawns spawns Explore Agent (Sonnet) Read-only, 7 tools, 9.6K sys prompt Security Agent (Sonnet) File analysis, vulnerability scanning find (structure) ls (root dir) grep (routes) Read (app.js) Read (routes/api) Read (.env check) Bash (deploy.sh) Read (package.json) Opus Synthesizes Findings Combines subagent reports into verdict Final Security Report "Not ready: missing rate limiting, timing-unsafe key comparison, no failsafe for empty API keys" 81 API calls total interactions ~6.5 minutes wall clock time 2 subagents parallel exploration 3 findings critical issues
The multi-agent architecture behind a Claude Code security review

Subagents get different prompts and cheaper models

The Explore subagent runs on Sonnet (3/3/15 per MTok vs Opus at 5/5/25). But the interesting part isn't the model, it's the system prompt. The subagent gets a completely different persona:

Explore subagent system prompt (excerpt)
You are a file search specialist for Claude Code, Anthropic's official CLI for Claude.
You excel at thoroughly navigating and exploring codebases.

=== CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS ===
This is a READ-ONLY exploration task. You are STRICTLY PROHIBITED from:
- Creating new files (no Write, touch, or file creation of any kind)
- Modifying existing files (no Edit operations)
- Deleting files (no rm or deletion)
...
The Explore agent gets a locked-down, search-focused system prompt

The subagent also gets a reduced tool set. Where the orchestrator has 15+ tools, the Explore agent only gets: Bash (read-only), Read, Monitor, Skill, TaskStop, WebFetch, WebSearch. No Edit, no Write, no Agent (can't spawn sub-sub-agents), no TaskCreate. And its Bash tool is conceptually restricted to read-only operations, though that's enforced via the system prompt rather than technically.

The subagent's system prompt is also smaller: ~9,600 tokens vs ~19,000 for the orchestrator. Less context to process, faster responses. The subagent's first call created a fresh 9,649-token cache (at the 5-minute tier, not 1-hour) because it's a new prompt.

Side-by-side: greeting vs. security review

Comparison: Two Interactions, Same Tool "hi, how are you?" "security review of the API" Intercepted calls 8 81 Wall clock time ~3 sec ~396 sec LLM calls 2 (quota + main) 20+ (orchestrator + subagents) Models used Sonnet + Opus Sonnet + Opus + Sonnet subs Subagents spawned 0 2+ Tools used None Agent, Bash, Read Thinking enabled Adaptive (unused) Adaptive (active) Sys prompt tokens ~21,000 ~21,000 (same) Output tokens (main) 22 1000+ Hook events fired 5 40+ Cache efficiency 93% cache hit 93% + subagent caches
Both sessions send the same 21K system prompt. The difference is entirely in what happens after.

The hook system: Claude Code's nervous system

Every interaction fires lifecycle hooks. These are local events (zero API cost) but they reveal how deeply instrumented Claude Code is. The greeting session fired these hooks:

  1. SessionStart (source: resume)
  2. UserPromptSubmit (prompt: "hi, how are you?")
  3. Stop (includes last_assistant_message, background_tasks, session_crons)
  4. Notification (type: idle_prompt, message: "Claude is waiting for your input")
  5. ConfigChange (source: user_settings)

The security review added many more: PreToolUse (before each tool call with full input), PostToolBatch (after tool results come back), SubagentStart, SubagentStop. The hook data includes everything: the tool name, inputs, outputs, the full agent transcript path, and timing. If you're running custom hooks (shell commands triggered by these events), every single tool call in a long session will execute your hook script.

bypassPermissions mode
Both sessions ran with permission_mode: 'bypassPermissions'. This means every tool call (Bash commands, file reads, agent spawning) executed without user confirmation prompts. In the security review, this allowed dozens of Bash commands and file reads to fire without interruption. Convenient for power users, but it means any tool call, including destructive ones, would execute silently.

What the system prompt reveals about Claude Code's design philosophy

Reading the full system prompt is like reading the engineering team's design manifesto. A few things stand out:

Conservative by default, overridable by instruction

The prompt is packed with "don't do X unless the user explicitly asks" rules. Don't commit unless asked. Don't push unless asked. Don't use --no-verify. Don't amend commits. Don't create documentation files. The default posture is passive and cautious, which makes sense for a tool that can execute arbitrary shell commands.

The 'measure twice, cut once' principle

Carefully consider the reversibility and blast radius of actions. The cost of pausing to confirm is low, while the cost of an unwanted action (lost work, unintended messages sent, deleted branches) can be very high.— Claude Code system prompt

The system prompt explicitly categorizes risky actions: destructive operations, hard-to-reverse operations, actions visible to others. It instructs Claude to confirm before proceeding with any of these. This is why Claude Code sometimes feels overly cautious. It's by design.

Anti-patterns are explicitly banned

The prompt bans specific coding anti-patterns: no premature abstractions ("three similar lines is better than a premature abstraction"), no speculative error handling ("don't add error handling for scenarios that can't happen"), no comments by default ("only add one when the WHY is non-obvious"). It even bans specific phrases in comments ("don't reference the current task, fix, or callers"). This level of specificity suggests these were real problems the team observed and corrected.

Cost and performance: where it's good, where it's not

Where Claude Code shines

  • Caching is aggressive and effective. 93%+ of system prompt tokens hit cache on repeat sessions. For the greeting, this turned a 0.10+callinto0.10+ call into 0.02.
  • Model routing is smart. Opus for orchestration and reasoning, Sonnet for exploration and metadata. Subagents don't need the expensive model.
  • Parallel tool calls. The Explore subagent fired find and ls in parallel on its first turn. The orchestrator launched subagents concurrently.
  • The security review found real issues. It identified timing-unsafe API key comparison, missing rate limiting, and empty API key failsafe problems. Not theoretical concerns, actual code-level findings.

Where it's wasteful

  • The system prompt is one-size-fits-all. A greeting gets the same 19K-token prompt about git protocols, PR creation, and deployment safety. There's no lightweight mode for simple queries.
  • Tool definitions dominate the prompt. The JSON schemas for 15 tools account for roughly 12,000 of the 19,000 system prompt tokens. That's 63% of the context budget spent describing capabilities the model may not use.
  • The quota check adds fixed latency. ~340ms on every session start, even when you're already well within limits.
  • Subagent prompts repeat boilerplate. The Explore agent still gets Bash tool documentation about git commits and PR creation, even though it can't write files. That's wasted context.

The hidden cost of "adaptive thinking"

Both sessions configured thinking: { type: 'adaptive' }. For the greeting, the thinking block was empty (the model decided it didn't need to think). For the security review, thinking was active. The thinking tokens don't show up in the intercepted data (they're represented as empty strings with a cryptographic signature), but they do consume compute time.

The TTFB (time to first byte) tells the story: 1.96 seconds for the greeting, vs a similar baseline for the security review's first turn, plus ongoing thinking on subsequent turns. Adaptive thinking is a good tradeoff: skip thinking for trivial queries, engage it for complex ones. But there's no way to force it off from the client side.

The context management escape hatch

Both sessions include this configuration:

"context_management": {
  "edits": [
    {
      "type": "clear_thinking_20251015",
      "keep": "all"
    }
  ]
}
Context management configuration from both sessions

This tells the API to keep all thinking content when compressing context. The system prompt also mentions: "When the conversation grows long, some or all of the current context is summarized." This is how Claude Code handles long sessions without hitting context limits. It compresses earlier turns while preserving the system prompt and recent context. The security review, spanning 81 interactions, likely hit this at least once.

Key takeaways

  1. Every Claude Code interaction sends ~19K tokens of system prompt, regardless of task complexity. Caching makes this affordable but not free.
  2. Three models, three roles: Sonnet for quota checks and metadata, Opus for orchestration and reasoning, Sonnet for subagent exploration. The routing is automatic and cost-aware.
  3. Subagents get constrained personas: fewer tools, smaller prompts, read-only restrictions. The orchestrator stays in control.
  4. Hooks fire everywhere: every tool call, every agent lifecycle event, every state change. If you're running hook scripts, expect them to fire dozens of times per complex task.
  5. Prompt caching is the economic backbone: without it, a simple greeting would cost 5x more. The 1-hour cache TTL means rapid-fire sessions are cheap.
  6. The system prompt is a design document: reading it reveals intentional tradeoffs around safety, code quality, and user experience. The cautious defaults aren't bugs, they're features.

The thing that surprised me most: the ratio of overhead to actual work. For a greeting, ~21,000 input tokens produce 22 output tokens. That's a 955:1 ratio. For the security review, the ratio improves dramatically because the system prompt cost is amortized across dozens of turns. Claude Code is optimized for sessions, not one-shots. Use it accordingly.

Comments

No comments yet.

Leave a comment

We use analytics cookies. Privacy