Your AI Bill Has a Per-Token Leak: 40-60% of API Tokens Are Wasted

The $450/Month You Didn’t Know You Were Losing

A 10-person engineering team making 1,000 API calls per day on Claude Opus 4.7 spends roughly $1,125/month on input tokens alone. According to research spanning 86,000 developers and multiple production audits, 40-60% of that spend goes to tokens that contribute nothing to the answer.

That’s $450–$675 per month paying for redundant context, re-sent system prompts, verbose reasoning, and tool schemas that never get called. The same team running the same models on-premises — where tokens are free — would never notice the waste. On a per-token API, every wasted token costs real money.

40-60%

Of LLM API budgets go to operational inefficiencies, not necessary usage

1,000x

More tokens consumed by agentic tasks vs. simple code reasoning

30x

Token usage variation across repeated runs of the same task

The first systematic study of token consumption in agentic coding — "How Do AI Agents Spend Your Money?" (arXiv:2604.22750) — tested 8 frontier LLMs on SWE-bench Verified. The findings: agentic tasks consume 1,000x more tokens than simple code reasoning. Input tokens drive cost, not output. And models differ by 1.5M+ tokens in efficiency on the same tasks. More tokens does not mean better accuracy — accuracy often peaks at intermediate cost.

"Agentic tasks consume orders of magnitude more tokens than simple reasoning tasks, and token usage varies enormously across repeated runs of the same task — up to 30x. More tokens does not reliably produce better results." — Bai et al., "How Do AI Agents Spend Your Money?" (April 2026)

Where the Tokens Leak

Token waste isn’t one problem. It’s ten problems, each compounding the others. Ranked by dollar impact on a typical production workload:

1. Tool/Function Schema Bloat — 55K–134K tokens before work starts

Every agentic API call includes JSON schemas for every tool the model might use. 40+ tools means 55,000–134,000 tokens of schema definitions sent before any actual work begins. On Claude Opus 4.7, that’s $0.28–$0.67 per call in tool definitions alone. Most of those tools are never called.

2. System Prompt Repetition — Stateless = Re-sent Every Call

LLM APIs are stateless. A 2K–8K token system prompt gets billed on every single request. A 10-turn conversation means that system prompt is billed 10 times. At 10,000 daily calls, that’s 20M+ tokens/month in system prompt repetition — $100+/month on Opus 4.7 just for saying the same thing over and over.

3. Context Accumulation — 50K+ Tokens of History Per Call

Multi-turn agents re-send full conversation history on each step. A 20-turn debugging session accumulates 50K+ tokens of history — billed in full on every subsequent call. Research shows the last few turns typically contain all the relevant context, but you pay for all 20.

4. Verbose Reasoning — 55–63% "Word Salad"

The EMNLP 2025 paper "Word Salad Chopper" found that 55–63% of reasoning tokens in DeepSeek-R1-Distill models are semantically redundant — repetitive loops that add no value. Models are self-aware when trapped in these loops (detectable with >93% accuracy), but they can’t stop themselves from producing them.

5. Review/Rework Loops — 59% of Agentic Tokens

Not initial generation, but iteration. Agentic review-and-fix cycles consume ~59% of total tokens. The model writes code, reviews it, finds issues, fixes them, reviews again. Most of this is productive — but a significant portion is the model correcting its own verbose output.

6. Context Stuffing — 3,729 Tokens vs. 67 Tokens for the Same Answer

The "context stuffing antipattern" — dumping all available information into a prompt rather than retrieving selectively — can inflate a query from 67 tokens to 3,729 tokens, a 55x waste factor. Accuracy also degrades in the "lost-in-the-middle" zone where models ignore information buried in long contexts.

7. Retry Amplification — 34% of Calls Are Retries

One production audit found 34% of API calls were retries — failed attempts that produced nothing usable. JSON parse errors, format violations, and tool-call failures all trigger automatic retries that bill again at full price.

8. Runaway Loops — The $47,000 Bill

In November 2025, a documented case showed an agentic coding tool stuck in an infinite loop, accumulating a $47,000 bill over 11 days. Usage caps and spend limits are essential defenses — but they’re reactive, not preventive.

9. RAG Retrieval Bloat — 5,000+ Tokens Per Call

Stuffing 10+ document chunks into every prompt adds 5,000+ input tokens per call — often more than the query and response combined. Selective retrieval with relevance thresholds can cut this by 70% without quality loss.

10. Few-Shot Bloat — 85% Input Reduction Achievable

Static prompts with many examples carry 5–20x more tokens than needed. Keyword-style compressed prompts achieve 85% input reduction with no accuracy loss. Most teams never audit their few-shot counts.

The compounding effect is the real problem. Tool schema bloat means you start 55K tokens in the hole. System prompt repetition multiplies that baseline by every call. Context accumulation adds another 50K per turn. By the 10th turn of a debugging session, you’re paying for 500K+ tokens per call — most of which the model has already seen and doesn’t need again.

The Same Model, Different Bill

The agent framework wrapped around the model determines how many tokens get spent. The same underlying model can produce wildly different bills depending on the framework’s token efficiency:

85%

Token savings from Claude Code’s programmatic tool calling vs. sequential approaches

80-98%

Token reduction via Aider’s AST repo-map and unified diffs

37-62%

Token reduction GitHub achieved across production workflows with token auditors

Token-Efficient Frameworks

Aider: AST repo-map selects only relevant files; unified diffs eliminate full-file re-sends; $0.01–$0.10 per feature
Claude Code: Grep+glob exploration avoids file dumps; exact-string-replace edits avoid re-sending entire files; accuracy improves 79.5% → 88.1%
Cursor: Apply model 5–10x cheaper per call; vector DB index adds negligible ongoing cost

Token-Heavy Frameworks

Unoptimized agents: Full file dumps on every call; 30-step session re-sends original prompt 30 times
Copilot with MCP: 40 MCP tools = 10–15KB schema per turn; each tool definition billed on every call
Naive RAG pipelines: 10+ document chunks per prompt regardless of relevance

The math is stark: a 200K-token session on Claude Opus 4.7, replayed across 30 agent turns, produces ~6M input tokens — roughly $18 in input costs alone. The same task with Aider’s selective file inclusion might use 200K–400K total tokens — $1–$2. The model is the same. The bill is 10x different.

GitHub’s own production optimization is instructive. After deploying token auditors across their agentic workflows, they achieved: Auto-Triage Issues 62% reduction, Security Guard 43% reduction, Smoke Claude 59% reduction, Daily Community Attribution 37% reduction. Their "Effective Tokens" formula weights cache-read tokens at 0.1x — matching Anthropic’s pricing structure and acknowledging that cached tokens are worth a fraction of fresh ones.

What Prompt Caching Actually Saves

Prompt caching is the most visible API-level defense against token waste. Both Anthropic and OpenAI offer it, but the economics differ significantly:

Anthropic Prompt Caching

Cache reads: 10% of base input price
Cache writes: 25% surcharge
90% cost reduction on cached portions
85% latency reduction
Requires explicit cache markers
5-minute TTL on cache entries

OpenAI Prompt Caching

Automatic (no code changes)
50–90% off input price
No write surcharge
Minimum 1,024 tokens
Cache hit rate: 60–87%
Varies by model tier

The "Don’t Break the Cache" study (arXiv:2601.06007) tested caching strategies across providers on the DeepResearch Bench. The key finding: strategic caching outperforms naive caching. Caching only the system prompt achieved 78–80% cost reduction. Naively caching the full context — including dynamic tool results that change every call — paradoxically increased latency by caching content that won’t be reused.

Caching helps with stable prefixes. It does nothing for the dynamic waste sources — context accumulation, verbose reasoning, retry loops, tool schema bloat on uncached calls. And it only works on API models where you’re paying per token. On-premises, caching is a performance optimization, not a cost necessity — because tokens are free.

The Chinese Tokenizer Curiosity: Open Models Have a Hidden Efficiency Edge

Here’s a curiosity that most English-speaking teams never encounter — but that reveals a deeper truth about why open models have efficiency advantages beyond raw benchmark scores.

LLM tokenizers are not language-neutral. The way a model splits text into tokens depends entirely on its training data and vocabulary design. And the differences are enormous:

+67%

More tokens for Chinese vs. English on GPT-4’s cl100k_base tokenizer

-35%

Fewer tokens for Chinese vs. English on DeepSeek’s tokenizer

39

Total CJK tokens in GPT-3’s tokenizer (out of 50K vocabulary)

Mark Huang’s benchmark tells the story. The same document — a CLAUDE.md file translated to Chinese — tokenized across six models:

Western Tokenizers: Chinese Costs More

GPT-3 (p50k_base): 1,329 EN → 3,753 ZH (+182%)
GPT-4 (cl100k_base): 1,200 EN → 2,001 ZH (+67%)
GPT-4o (o200k_base): 1,196 EN → 1,479 ZH (+24%)

Chinese-Origin Tokenizers: Chinese Costs Less

Qwen 2.5: 1,203 EN → 1,351 ZH (+12%)
GLM-4: 1,202 EN → 1,307 ZH (+9%)
DeepSeek-V2: 1,324 EN → 1,427 ZH (+8%)
Kimi: 0.81x ratio (Chinese cheaper)

The reason is vocabulary design. GPT-3’s tokenizer had only 39 CJK tokens total — nearly every Chinese character was split into 2–3 byte-level tokens. Chinese-origin models learned BPE merges from Chinese-heavy training data, so common characters and phrases exist as whole tokens. Qwen’s 152K vocabulary, DeepSeek’s 128K vocabulary, and GLM’s 130K vocabulary all dedicate significant space to CJK characters that Western models fragment.

The cost implication is direct. On a Western tokenizer, a Chinese-language team pays 24–67% more per query for the same semantic content. On DeepSeek’s tokenizer, Chinese is 35% cheaper than English. Aran Komatsuzaki’s cross-model test confirmed the pattern: Claude’s old tokenizer charged 1.65x for Chinese; Kimi’s charged 0.81x.

"Each additional token per word reduces MCQA accuracy by 8–18 percentage points. Chinese users don’t just pay more — they get worse results." — Lundin et al., "The Token Tax: Systematic Bias in Multilingual Tokenization" (arXiv:2509.05486)

Now, before you start translating your English prompts to Chinese to save tokens — don’t. A SWE-bench study (arXiv:2604.14210) tested exactly this: Chinese prompts on coding tasks cost more tokens and dropped 9.5 percentage points in task success rate. The tokenizer savings are real, but the language-switch penalty erases them for English-coded workloads.

The real insight isn’t about language. It’s about tokenizer control. Open models ship their tokenizers as code. You can inspect them, understand their efficiency characteristics, and — critically — modify them. Chinese-origin models like GLM-5.1 and DeepSeek V4 have vocabularies optimized for multilingual efficiency because their creators needed that. If you have multilingual workloads, those tokenizers save money at the tokenizer level before the model even runs.

You can’t change Claude’s tokenizer. You can’t audit GPT-4o’s vocabulary allocation. With open models, tokenizer efficiency is another dimension of the cost stack you can inspect and optimize — just like scaffolding, just like model selection, just like on-premises deployment.

Big Task, Big Bill

The waste compounds fast at scale. Real-world agentic coding projects show just how many tokens these systems burn — and what they cost:

2.6B

Tokens processed by Antigravity 2.0 to build an entire OS in 12 hours — for under $1,000 on Gemini 3.5 Flash

603B

Tokens consumed by OpenClaw’s 100 Codex instances over 30 days — at $1.3M on GPT-5.5 Fast Mode

~$80K

Estimated API cost for Cursor’s browser-from-scratch build — 1M+ lines of code, 1,000+ parallel agents

Google’s Antigravity 2.0 demo at I/O 2026 ran 93 parallel agents for 12 hours, processing 2.6 billion tokens across 15,000 model calls — and built a complete operating system from scratch. The entire run cost under $1,000 in API credits, because Gemini 3.5 Flash charges $1.50/$9.00 per million tokens and aggressive caching brought the effective input cost to $0.15/M. Run the same 2.6B tokens through Claude Opus 4.7 without caching: $20,800. With caching at 50% hit rate: $10,400. Same task, 10–20x the price.

At the other extreme, Peter Steinberger’s OpenClaw project ran 100 GPT-5.5 Codex instances for 30 days, consuming 603 billion tokens at a cost of $1.3 million in Fast Mode (or ~$300K standard). Jason Hoffman’s startup used 6–15 parallel Claude Code instances for 27 days, burning ~1 billion input tokens for $1,800 — shipping 309K lines of code across 165 screens. Alair J.T. processed 26.8 billion tokens in 46 days across 39 projects, with a 46:1 cache leverage ratio meaning 97.9% of tokens were cheap cache reads.

These are not hypotheticals. They are production data from real teams shipping real software. And they show a consistent pattern: the bill scales with token volume, not with output quality. The Antigravity OS build was cheap because the model was cheap and caching was aggressive. The OpenClaw build was expensive because neither was true.

For individual developers, the monthly numbers are equally revealing:

Developer Token Profiles

Light (1–2 sessions/day): 50–100M tokens/month — $50–$100
Medium (3–5 hrs/day): 200–500M tokens/month — $130–$260
Heavy (all-day multi-agent): 2–5B tokens/month — $2,000–$12,500
Extreme (autonomous fleet): 5B–603B tokens/month — $12,500–$1.3M

Waste at Scale

Heavy user at 50% waste on Opus 4.7: $5,000/month thrown away
Team of 10 heavy users: $50,000/month in waste
Year: $600,000 paying for tokens that contributed nothing
On-premises: the same waste is free

Run the numbers for your own team. Use the cost calculator →

The Fix: Where Open Models Win on Efficiency

Token waste is a solvable problem. The fix combines architectural choices with deployment strategy:

Route simple tasks to cheaper models. FrugalGPT (Stanford) demonstrated up to 98% cost reduction through LLM cascades — routing queries from cheapest to most expensive with reliability scoring. No single model is universally superior; cheap models correctly answer queries that expensive models get wrong.
Use prompt caching strategically. Cache system prompts only, not full context. The "Don’t Break the Cache" study showed this achieves 78–80% cost reduction without the latency penalty of naive caching.
Trim tool schemas to what’s needed per step. 85% of tool schema overhead is removable via on-demand loading. Only send the 3–5 tools relevant to the current step, not the full 40+ tool catalog.
Compress system prompts. 85% input reduction is achievable via keyword-style compression without accuracy loss. Audit your system prompts — most are 3–5x longer than necessary.
Use open models with efficient tokenizers for multilingual workloads. DeepSeek and Kimi tokenizers make Chinese 35% cheaper; GLM’s tokenizer narrows the gap to 9%. If your team works in multiple languages, tokenizer choice is a cost factor proprietary APIs don’t let you control.
Run on-premises to eliminate per-token billing entirely. Waste doesn’t cost extra when tokens are free. A Faraday Machines cluster running Kimi K2.6 or GLM-5.1 pays only for hardware amortization and electricity. Verbose reasoning, retry loops, context accumulation — none of it appears on a bill.

The combined effect of these optimizations is 60–80% cost reduction on API spend. But the most powerful fix is the last one: on-premises deployment makes token waste an engineering problem instead of a financial one. You still want efficient code — but you’re optimizing for speed and quality, not for billing.

The Bottom Line

40–60% of your API bill is waste you can measure, optimize, and eliminate. The waste sources are well-documented: tool schema bloat, system prompt repetition, context accumulation, verbose reasoning, retry loops. The fixes are available: model routing, strategic caching, schema trimming, prompt compression.

But the most fundamental fix is architectural. Per-token billing makes every inefficiency expensive. On-premises deployment makes every inefficiency free — and lets you focus on what matters: getting the right answer, not minimizing the token count to get there.

Open models compound this advantage. Their tokenizers can be inspected and optimized for your language mix. Their scaffolding can be customized without vendor lock-in. And on a Faraday Machines cluster, they run at full speed with zero per-token costs, zero data leakage, and zero usage caps.

The question isn’t whether your API bill has a leak. It does. The question is whether you want to keep paying for it — or fix it at the architecture level.

References

[1] Bai et al. (2026). "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks." arXiv:2604.22750. Available at: arxiv.org

[2] Lundin et al. (2025). "The Token Tax: Systematic Bias in Multilingual Tokenization." arXiv:2509.05486. Available at: arxiv.org

[3] Chen et al. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." Stanford. arXiv:2305.05176. Available at: arxiv.org

[4] "Word Salad Chopper" (EMNLP 2025). "Chopping Word Salad: Detecting and Reducing Redundancy in Reasoning Tokens." Available at: aclanthology.org

[5] "Don’t Break the Cache" (2026). "Prompt Caching Strategies for Production LLM Pipelines." arXiv:2601.06007. Available at: arxiv.org

[6] Mark Huang. (2026). "No, Chinese Is Not More Token-Efficient Than English for LLMs." Available at: markhuang.ai

[7] GitHub Blog. (2026). "Improving Token Efficiency in GitHub Agentic Workflows." Available at: github.blog

[8] "Mythbuster: Chinese Language Is Not More Efficient Than English in Vibe Coding." arXiv:2604.14210. Available at: arxiv.org

[9] Tian Pan. (2026). "The Hidden Token Tax: How Production LLM Pipelines Waste 30–60% of Your Context Window." Available at: tianpan.co

[10] Sparkco AI. (2026). "The Token Waste Problem: How Modern AI Agents Are Cutting Context Costs by 38%." Available at: sparkco.ai