Your AI Bill Has a Per-Token Leak: 40-60% of API Tokens Are Wasted
Research across 86,000 developers shows 40–60% of LLM API budgets are consumed by operational inefficiencies — redundant system prompts, underutilized tool schemas, and semantically empty reasoning tokens. Open models with efficient tokenizers and on-premises deployment eliminate the per-token cost of waste entirely.
The $450/Month You Didn’t Know You Were Losing
A 10-person engineering team making 1,000 API calls per day on Claude Opus 4.7 spends roughly $1,125/month on input tokens alone. According to research spanning 86,000 developers and multiple production audits, 40-60% of that spend goes to tokens that contribute nothing to the answer.
That’s $450–$675 per month paying for redundant context, re-sent system prompts, verbose reasoning, and tool schemas that never get called. The same team running the same models on-premises — where tokens are free — would never notice the waste. On a per-token API, every wasted token costs real money.
The first systematic study of token consumption in agentic coding — "How Do AI Agents Spend Your Money?" (arXiv:2604.22750) — tested 8 frontier LLMs on SWE-bench Verified. The findings: agentic tasks consume 1,000x more tokens than simple code reasoning. Input tokens drive cost, not output. And models differ by 1.5M+ tokens in efficiency on the same tasks. More tokens does not mean better accuracy — accuracy often peaks at intermediate cost.
"Agentic tasks consume orders of magnitude more tokens than simple reasoning tasks, and token usage varies enormously across repeated runs of the same task — up to 30x. More tokens does not reliably produce better results." — Bai et al., "How Do AI Agents Spend Your Money?" (April 2026)
Where the Tokens Leak
Token waste isn’t one problem. It’s ten problems, each compounding the others. Ranked by dollar impact on a typical production workload:
1. Tool/Function Schema Bloat — 55K–134K tokens before work starts
Every agentic API call includes JSON schemas for every tool the model might use. 40+ tools means 55,000–134,000 tokens of schema definitions sent before any actual work begins. On Claude Opus 4.7, that’s $0.28–$0.67 per call in tool definitions alone. Most of those tools are never called.
2. System Prompt Repetition — Stateless = Re-sent Every Call
LLM APIs are stateless. A 2K–8K token system prompt gets billed on every single request. A 10-turn conversation means that system prompt is billed 10 times. At 10,000 daily calls, that’s 20M+ tokens/month in system prompt repetition — $100+/month on Opus 4.7 just for saying the same thing over and over.
3. Context Accumulation — 50K+ Tokens of History Per Call
Multi-turn agents re-send full conversation history on each step. A 20-turn debugging session accumulates 50K+ tokens of history — billed in full on every subsequent call. Research shows the last few turns typically contain all the relevant context, but you pay for all 20.
4. Verbose Reasoning — 55–63% "Word Salad"
The EMNLP 2025 paper "Word Salad Chopper" found that 55–63% of reasoning tokens in DeepSeek-R1-Distill models are semantically redundant — repetitive loops that add no value. Models are self-aware when trapped in these loops (detectable with >93% accuracy), but they can’t stop themselves from producing them.
5. Review/Rework Loops — 59% of Agentic Tokens
Not initial generation, but iteration. Agentic review-and-fix cycles consume ~59% of total tokens. The model writes code, reviews it, finds issues, fixes them, reviews again. Most of this is productive — but a significant portion is the model correcting its own verbose output.
6. Context Stuffing — 3,729 Tokens vs. 67 Tokens for the Same Answer
The "context stuffing antipattern" — dumping all available information into a prompt rather than retrieving selectively — can inflate a query from 67 tokens to 3,729 tokens, a 55x waste factor. Accuracy also degrades in the "lost-in-the-middle" zone where models ignore information buried in long contexts.
7. Retry Amplification — 34% of Calls Are Retries
One production audit found 34% of API calls were retries — failed attempts that produced nothing usable. JSON parse errors, format violations, and tool-call failures all trigger automatic retries that bill again at full price.
8. Runaway Loops — The $47,000 Bill
In November 2025, a documented case showed an agentic coding tool stuck in an infinite loop, accumulating a $47,000 bill over 11 days. Usage caps and spend limits are essential defenses — but they’re reactive, not preventive.
9. RAG Retrieval Bloat — 5,000+ Tokens Per Call
Stuffing 10+ document chunks into every prompt adds 5,000+ input tokens per call — often more than the query and response combined. Selective retrieval with relevance thresholds can cut this by 70% without quality loss.
10. Few-Shot Bloat — 85% Input Reduction Achievable
Static prompts with many examples carry 5–20x more tokens than needed. Keyword-style compressed prompts achieve 85% input reduction with no accuracy loss. Most teams never audit their few-shot counts.
The compounding effect is the real problem. Tool schema bloat means you start 55K tokens in the hole. System prompt repetition multiplies that baseline by every call. Context accumulation adds another 50K per turn. By the 10th turn of a debugging session, you’re paying for 500K+ tokens per call — most of which the model has already seen and doesn’t need again.
The Same Model, Different Bill
The agent framework wrapped around the model determines how many tokens get spent. The same underlying model can produce wildly different bills depending on the framework’s token efficiency:
Token-Efficient Frameworks
- Aider: AST repo-map selects only relevant files; unified diffs eliminate full-file re-sends; $0.01–$0.10 per feature
- Claude Code: Grep+glob exploration avoids file dumps; exact-string-replace edits avoid re-sending entire files; accuracy improves 79.5% → 88.1%
- Cursor: Apply model 5–10x cheaper per call; vector DB index adds negligible ongoing cost
Token-Heavy Frameworks
- Unoptimized agents: Full file dumps on every call; 30-step session re-sends original prompt 30 times
- Copilot with MCP: 40 MCP tools = 10–15KB schema per turn; each tool definition billed on every call
- Naive RAG pipelines: 10+ document chunks per prompt regardless of relevance
The math is stark: a 200K-token session on Claude Opus 4.7, replayed across 30 agent turns, produces ~6M input tokens — roughly $18 in input costs alone. The same task with Aider’s selective file inclusion might use 200K–400K total tokens — $1–$2. The model is the same. The bill is 10x different.
GitHub’s own production optimization is instructive. After deploying token auditors across their agentic workflows, they achieved: Auto-Triage Issues 62% reduction, Security Guard 43% reduction, Smoke Claude 59% reduction, Daily Community Attribution 37% reduction. Their "Effective Tokens" formula weights cache-read tokens at 0.1x — matching Anthropic’s pricing structure and acknowledging that cached tokens are worth a fraction of fresh ones.
What Prompt Caching Actually Saves
Prompt caching is the most visible API-level defense against token waste. Both Anthropic and OpenAI offer it, but the economics differ significantly:
Anthropic Prompt Caching
- Cache reads: 10% of base input price
- Cache writes: 25% surcharge
- 90% cost reduction on cached portions
- 85% latency reduction
- Requires explicit cache markers
- 5-minute TTL on cache entries
OpenAI Prompt Caching
- Automatic (no code changes)
- 50–90% off input price
- No write surcharge
- Minimum 1,024 tokens
- Cache hit rate: 60–87%
- Varies by model tier
The "Don’t Break the Cache" study (arXiv:2601.06007) tested caching strategies across providers on the DeepResearch Bench. The key finding: strategic caching outperforms naive caching. Caching only the system prompt achieved 78–80% cost reduction. Naively caching the full context — including dynamic tool results that change every call — paradoxically increased latency by caching content that won’t be reused.
Caching helps with stable prefixes. It does nothing for the dynamic waste sources — context accumulation, verbose reasoning, retry loops, tool schema bloat on uncached calls. And it only works on API models where you’re paying per token. On-premises, caching is a performance optimization, not a cost necessity — because tokens are free.
The Chinese Tokenizer Curiosity: Open Models Have a Hidden Efficiency Edge
Here’s a curiosity that most English-speaking teams never encounter — but that reveals a deeper truth about why open models have efficiency advantages beyond raw benchmark scores.
LLM tokenizers are not language-neutral. The way a model splits text into tokens depends entirely on its training data and vocabulary design. And the differences are enormous:
Mark Huang’s benchmark tells the story. The same document — a CLAUDE.md file translated to Chinese — tokenized across six models:
Western Tokenizers: Chinese Costs More
- GPT-3 (p50k_base): 1,329 EN → 3,753 ZH (+182%)
- GPT-4 (cl100k_base): 1,200 EN → 2,001 ZH (+67%)
- GPT-4o (o200k_base): 1,196 EN → 1,479 ZH (+24%)
Chinese-Origin Tokenizers: Chinese Costs Less
- Qwen 2.5: 1,203 EN → 1,351 ZH (+12%)
- GLM-4: 1,202 EN → 1,307 ZH (+9%)
- DeepSeek-V2: 1,324 EN → 1,427 ZH (+8%)
- Kimi: 0.81x ratio (Chinese cheaper)
The reason is vocabulary design. GPT-3’s tokenizer had only 39 CJK tokens total — nearly every Chinese character was split into 2–3 byte-level tokens. Chinese-origin models learned BPE merges from Chinese-heavy training data, so common characters and phrases exist as whole tokens. Qwen’s 152K vocabulary, DeepSeek’s 128K vocabulary, and GLM’s 130K vocabulary all dedicate significant space to CJK characters that Western models fragment.
The cost implication is direct. On a Western tokenizer, a Chinese-language team pays 24–67% more per query for the same semantic content. On DeepSeek’s tokenizer, Chinese is 35% cheaper than English. Aran Komatsuzaki’s cross-model test confirmed the pattern: Claude’s old tokenizer charged 1.65x for Chinese; Kimi’s charged 0.81x.
"Each additional token per word reduces MCQA accuracy by 8–18 percentage points. Chinese users don’t just pay more — they get worse results." — Lundin et al., "The Token Tax: Systematic Bias in Multilingual Tokenization" (arXiv:2509.05486)
Now, before you start translating your English prompts to Chinese to save tokens — don’t. A SWE-bench study (arXiv:2604.14210) tested exactly this: Chinese prompts on coding tasks cost more tokens and dropped 9.5 percentage points in task success rate. The tokenizer savings are real, but the language-switch penalty erases them for English-coded workloads.
The real insight isn’t about language. It’s about tokenizer control. Open models ship their tokenizers as code. You can inspect them, understand their efficiency characteristics, and — critically — modify them. Chinese-origin models like GLM-5.1 and DeepSeek V4 have vocabularies optimized for multilingual efficiency because their creators needed that. If you have multilingual workloads, those tokenizers save money at the tokenizer level before the model even runs.
You can’t change Claude’s tokenizer. You can’t audit GPT-4o’s vocabulary allocation. With open models, tokenizer efficiency is another dimension of the cost stack you can inspect and optimize — just like scaffolding, just like model selection, just like on-premises deployment.
Big Task, Big Bill
The waste compounds fast at scale. Real-world agentic coding projects show just how many tokens these systems burn — and what they cost:
Google’s Antigravity 2.0 demo at I/O 2026 ran 93 parallel agents for 12 hours, processing 2.6 billion tokens across 15,000 model calls — and built a complete operating system from scratch. The entire run cost under $1,000 in API credits, because Gemini 3.5 Flash charges $1.50/$9.00 per million tokens and aggressive caching brought the effective input cost to $0.15/M. Run the same 2.6B tokens through Claude Opus 4.7 without caching: $20,800. With caching at 50% hit rate: $10,400. Same task, 10–20x the price.
At the other extreme, Peter Steinberger’s OpenClaw project ran 100 GPT-5.5 Codex instances for 30 days, consuming 603 billion tokens at a cost of $1.3 million in Fast Mode (or ~$300K standard). Jason Hoffman’s startup used 6–15 parallel Claude Code instances for 27 days, burning ~1 billion input tokens for $1,800 — shipping 309K lines of code across 165 screens. Alair J.T. processed 26.8 billion tokens in 46 days across 39 projects, with a 46:1 cache leverage ratio meaning 97.9% of tokens were cheap cache reads.
These are not hypotheticals. They are production data from real teams shipping real software. And they show a consistent pattern: the bill scales with token volume, not with output quality. The Antigravity OS build was cheap because the model was cheap and caching was aggressive. The OpenClaw build was expensive because neither was true.
For individual developers, the monthly numbers are equally revealing:
Developer Token Profiles
- Light (1–2 sessions/day): 50–100M tokens/month — $50–$100
- Medium (3–5 hrs/day): 200–500M tokens/month — $130–$260
- Heavy (all-day multi-agent): 2–5B tokens/month — $2,000–$12,500
- Extreme (autonomous fleet): 5B–603B tokens/month — $12,500–$1.3M
Waste at Scale
- Heavy user at 50% waste on Opus 4.7: $5,000/month thrown away
- Team of 10 heavy users: $50,000/month in waste
- Year: $600,000 paying for tokens that contributed nothing
- On-premises: the same waste is free
Run the numbers for your own team. Use the cost calculator →
The Fix: Where Open Models Win on Efficiency
Token waste is a solvable problem. The fix combines architectural choices with deployment strategy:
- Route simple tasks to cheaper models. FrugalGPT (Stanford) demonstrated up to 98% cost reduction through LLM cascades — routing queries from cheapest to most expensive with reliability scoring. No single model is universally superior; cheap models correctly answer queries that expensive models get wrong.
- Use prompt caching strategically. Cache system prompts only, not full context. The "Don’t Break the Cache" study showed this achieves 78–80% cost reduction without the latency penalty of naive caching.
- Trim tool schemas to what’s needed per step. 85% of tool schema overhead is removable via on-demand loading. Only send the 3–5 tools relevant to the current step, not the full 40+ tool catalog.
- Compress system prompts. 85% input reduction is achievable via keyword-style compression without accuracy loss. Audit your system prompts — most are 3–5x longer than necessary.
- Use open models with efficient tokenizers for multilingual workloads. DeepSeek and Kimi tokenizers make Chinese 35% cheaper; GLM’s tokenizer narrows the gap to 9%. If your team works in multiple languages, tokenizer choice is a cost factor proprietary APIs don’t let you control.
- Run on-premises to eliminate per-token billing entirely. Waste doesn’t cost extra when tokens are free. A Faraday Machines cluster running Kimi K2.6 or GLM-5.1 pays only for hardware amortization and electricity. Verbose reasoning, retry loops, context accumulation — none of it appears on a bill.
The combined effect of these optimizations is 60–80% cost reduction on API spend. But the most powerful fix is the last one: on-premises deployment makes token waste an engineering problem instead of a financial one. You still want efficient code — but you’re optimizing for speed and quality, not for billing.
The Bottom Line
40–60% of your API bill is waste you can measure, optimize, and eliminate. The waste sources are well-documented: tool schema bloat, system prompt repetition, context accumulation, verbose reasoning, retry loops. The fixes are available: model routing, strategic caching, schema trimming, prompt compression.
But the most fundamental fix is architectural. Per-token billing makes every inefficiency expensive. On-premises deployment makes every inefficiency free — and lets you focus on what matters: getting the right answer, not minimizing the token count to get there.
Open models compound this advantage. Their tokenizers can be inspected and optimized for your language mix. Their scaffolding can be customized without vendor lock-in. And on a Faraday Machines cluster, they run at full speed with zero per-token costs, zero data leakage, and zero usage caps.
The question isn’t whether your API bill has a leak. It does. The question is whether you want to keep paying for it — or fix it at the architecture level.
References
[1] Bai et al. (2026). "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks." arXiv:2604.22750. Available at: arxiv.org
[2] Lundin et al. (2025). "The Token Tax: Systematic Bias in Multilingual Tokenization." arXiv:2509.05486. Available at: arxiv.org
[3] Chen et al. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." Stanford. arXiv:2305.05176. Available at: arxiv.org
[4] "Word Salad Chopper" (EMNLP 2025). "Chopping Word Salad: Detecting and Reducing Redundancy in Reasoning Tokens." Available at: aclanthology.org
[5] "Don’t Break the Cache" (2026). "Prompt Caching Strategies for Production LLM Pipelines." arXiv:2601.06007. Available at: arxiv.org
[6] Mark Huang. (2026). "No, Chinese Is Not More Token-Efficient Than English for LLMs." Available at: markhuang.ai
[7] GitHub Blog. (2026). "Improving Token Efficiency in GitHub Agentic Workflows." Available at: github.blog
[8] "Mythbuster: Chinese Language Is Not More Efficient Than English in Vibe Coding." arXiv:2604.14210. Available at: arxiv.org
[9] Tian Pan. (2026). "The Hidden Token Tax: How Production LLM Pipelines Waste 30–60% of Your Context Window." Available at: tianpan.co
[10] Sparkco AI. (2026). "The Token Waste Problem: How Modern AI Agents Are Cutting Context Costs by 38%." Available at: sparkco.ai
Stop Paying for Wasted Tokens
Faraday Machines clusters run Kimi K2.6 and GLM-5.1 at full speed with zero per-token costs. Verbose reasoning, retry loops, and context accumulation don’t appear on a bill — because there is no bill. Audit your current API spend, then see what on-premises saves.
Get a Cost Comparison