← Back to lab

Token Savings Guide

Why you're burning tokens faster than you think — and practical ways to cut your AI bill by 60%+ across any LLM tool or API.

Most developers using AI tools focus on model pricing — $3/M input tokens vs $15/M output tokens. But the real cost story isn't about price per token. It's about how many tokens you waste without realizing it. After analyzing real usage data from Claude Code, API workflows, and agentic loops, a clear picture emerges: more than half of the tokens you pay for are completely unnecessary.

The Five Ways You're Wasting Tokens

1. Tool Schema Bloat

Every AI tool sends its full schema — descriptions, parameter types, enums — with every single API call. For a typical agentic setup with file operations, bash access, web search, and code editing, that's roughly 25-30% of your context window consumed before you've even asked a question. A single Claude Code turn loads ~45,000 tokens of tool definitions alone. The fix? Enable tool search so only relevant tools are loaded per turn, or trim your tool definitions aggressively if you're building custom agents.

# Claude Code: cut startup context from 45k → 20k tokens
ENABLE_TOOL_SEARCH=true

This single setting saves ~14,000 tokens per turn. At scale, that's the difference between a $50/month bill and a $200/month bill.

2. Cache Expiry — The Silent Killer

Prompt caching saves 90% on repeated tokens — but only if the cache hasn't expired. Claude's cache TTL is 5 minutes (Pro) to 60 minutes (Max tier). Here's the problem: if you step away for 6 minutes and come back, the entire conversation history, all tool schemas, and all system prompts get re-processed at full price. In real-world data, 54% of turns hit an expired cache, causing a 10x cost spike on those turns.

The solution isn't to type faster — it's to structure your workflow around cache-friendly patterns. Batch related questions together. Use session-based tools instead of restarting. If you're building an agent, implement checkpointing so it can resume without re-sending the full history.

3. Redundant File Reads

AI agents read files. A lot. In a typical coding session, the same file gets read 3-7 times across different turns — once for context, again for a diff, again to verify changes, again after a failed edit. Each read sends the full file contents as input tokens. For a 500-line file, that's 2,000-3,000 tokens per read, repeated unnecessarily.

Better approaches: keep a working memory of already-read files, use diff-only operations instead of full reads, and implement file caching in your agent loop. If you're an API user, track what's in context and avoid re-sending static content.

4. Stateless Conversation Rebuild

LLM conversations are stateless — every turn rebuilds the full history from scratch. Turn 10 of a conversation doesn't just send message 10; it sends messages 1-9 plus the new one. For long sessions, this means your token usage grows quadratically. A 20-turn session might consume 500k+ tokens just in history replay.

Mitigation: use conversation summarization at checkpoints (compress turns 1-8 into a summary before sending turn 9). Structure your prompts so earlier context can be safely dropped. For agentic loops, implement a sliding window that keeps only the most relevant N turns in full.

5. Over-Engineering the System Prompt

System prompts are sent with every API call. A verbose 4,000-token system prompt costs the same as 4,000 input tokens every single turn. Over a 50-turn session, that's 200,000 tokens just on instructions. Audit your system prompts ruthlessly — remove examples that the model already understands, compress multi-paragraph instructions into concise rules, and move rarely-needed instructions into a separate lookup mechanism.

tokens-burnings.png

This Isn't Just a Claude Code Problem

These patterns apply everywhere: OpenAI Agents, LangChain workflows, custom API integrations, Cursor, Copilot, any tool that maintains conversation state. The underlying economics are the same — you pay for every token sent, and wasted tokens compound over session length.

If you're building with the Anthropic API directly, the same principles apply: cache your system prompts, minimize tool definitions per call, implement conversation compression, and batch operations to maximize cache hits. The API gives you more control than any wrapper tool — use it.

Quick Wins Checklist

Enable tool search — cut 14k tokens per turn in Claude Code

Batch questions in one turn instead of spreading across many

Use shorter system prompts — compress instructions, drop examples

Implement conversation summarization at checkpoints

Track file reads — cache what's already in context

Match your cache TTL to your workflow cadence

Use diff operations instead of full file re-reads

For API users: implement sliding window context management

The Math That Matters

Let's say you do 100 turns per day across your AI tools. Without optimization, each turn averages ~50k input tokens. With the fixes above — tool search, cache-friendly batching, compressed prompts, file read caching — you can bring that down to ~20k per turn. That's 3M tokens saved per day. At standard pricing, that's real money back in your pocket every month.

The biggest insight isn't about using a cheaper model — it's about not paying for tokens you don't need. Optimize the waste first, then evaluate whether you even need to switch models.

If You're on Claude Pro or Max (Not Just API)

The five patterns above are engineering-level fixes. But if you pay for Claude Pro ($20/mo) or Max ($100–200/mo) and work in the chat interface, the same waste happens — just in a different form. These six fixes require no code.

1. Caveman Mode — 65–87% Token Reduction

A Claude plugin that strips the conversational filler. Instead of "I'd be happy to help! Let me walk you through this step by step..." you get the answer, then it stops. Average savings: 75% per request.

Real benchmark numbers from API testing:

explain React re-render bug:        1,180 → 159 tokens  (87% saved)
fix auth middleware:                   704 → 121 tokens  (83% saved)
set up PostgreSQL connection pool:  2,347 → 380 tokens  (84% saved)
implement React error boundary:     3,454 → 456 tokens  (87% saved)
debug PostgreSQL race condition:    1,200 → 232 tokens  (81% saved)

Average across 10 tasks: 75% savings

Three compression levels: lite (professional, no filler), full (fragments, grunt mode), ultra (maximum compression). Start with lite. Switch to full for agentic tasks where you're reading structured output anyway.

2. Window Anchoring — Zero Dead Hours

Claude usage runs on a sliding 5-hour window. If your first real message is at 8:30am, the window anchors at 8:00 and runs until 13:00. Hit the limit at 11:00 — you wait 2 hours.

Fix: send a throwaway message to Haiku at 6:15am. Window anchors at 6:00, runs until 11:00. At 11:00 it resets immediately — next window is 11:00–16:00. Zero dead hours. Same total budget, better distribution.

# .github/workflows/claude-warmup.yml
on:
  schedule:
    - cron: '15 6 * * 1-5'  # 6:15am weekdays
jobs:
  warmup:
    runs-on: ubuntu-latest
    steps:
      - name: Ping Haiku
        run: |
          curl -X POST https://api.anthropic.com/v1/messages \
            -H "x-api-key: ${{ secrets.CLAUDE_API_KEY }}" \
            -H "anthropic-version: 2023-06-01" \
            -H "content-type: application/json" \
            -d '{"model":"claude-haiku-4-5-20251001","max_tokens":1,"messages":[{"role":"user","content":"hi"}]}'

Or use Claude's built-in scheduling: /schedule "send 'hi' to haiku at 6:15 AM every weekday"

3. Edit, Don't Follow Up

Every new message adds to context. Claude re-reads the entire history on every turn. The cost compounds fast:

Turn  5: ~7,500 tokens of history
Turn 10: ~27,500 tokens
Turn 20: ~105,000 tokens
Turn 30: ~232,000 tokens  ← message 30 costs 31x more than message 1

When Claude misunderstands, don't type "No, I meant X" as a new message. Hit Edit on the original, fix it, regenerate. The bad exchange is replaced, not stacked.

4. Batch Your Questions

Three questions in one message = one context load. Three questions in three messages = three context loads.

# Bad: three separate turns
"Summarize this article"          → wait
"Now list the main points"        → wait
"Now suggest a headline"          → wait

# Good: one turn
"Summarize this article, list the main points, and suggest a headline."  → wait once

Answers are often better too — Claude sees the full picture at once instead of rebuilding context between turns.

5. Projects — Upload Once, Reference Forever

Uploading the same PDF to multiple chats re-tokenizes it every time. A 100-page document is ~75,000 tokens. Uploaded five times = 375,000 tokens burned.

In Projects: upload once, cached. Every conversation inside the project references it without burning tokens. Contracts, briefs, style guides — this single habit can save $15–40/month in repeat uploads alone.

6. Memory — Zero Setup Tokens

Every new chat without saved context = 3–5 setup messages. "I'm a developer, I want short answers, always use TypeScript..." — repeated across every new conversation.

5 messages × 500 tokens × 10 new chats per day = 25,000 tokens/day
just repeating the same setup information

Settings → Memory. Save your role and preferences once. Claude applies them to every new chat automatically. Zero setup tokens.

Resources

Anthropic Prompt Caching docs — how cache TTL works and how to maximize hits

Claude Code settings reference — ENABLE_TOOL_SEARCH and other token-saving options

Token counting tools — tiktoken, Anthropic tokenizer — know your actual usage

Conversation compression patterns — LangChain, LlamaIndex, and custom implementations