How to Reduce LLM API Costs: Token Optimization Strategies
A practical, battle-tested guide to cutting your LLM API bills by 60-90% without sacrificing output quality. Seven strategies I have refined over years of building production AI features.
Table of Contents
- Introduction: Why LLM API Costs Spiral Out of Control
- Understanding Token Economics
- Strategy 1: Prompt Engineering for Fewer Tokens
- Strategy 2: Context Window Management
- Strategy 3: Code Context Optimization
- Strategy 4: Caching and Memoization
- Strategy 5: Model Routing and Tiering
- Strategy 6: Batching and Rate Management
- Strategy 7: Output Token Control
- Measuring and Monitoring Costs
- Real-World Cost Optimization Case Study
- Conclusion
Introduction: Why LLM API Costs Spiral Out of Control
The first time I integrated an LLM into a production application, the bill at the end of month one was $2,400. I had budgeted $300. That was three years ago, and the lesson cost me more than money. It cost me sleep, a very awkward conversation with my manager, and a weekend spent rewriting half the integration layer. Since then, I have built LLM-powered features across five different products, and cost optimization has become one of my core competencies. This guide distills everything I have learned.
Here is the uncomfortable truth: LLM APIs are priced per token, and tokens accumulate far faster than most developers expect. Let us look at the current pricing landscape for the most popular models as of early 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o mini | $0.15 | $0.60 | 128K |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| Claude 3.5 Haiku | $0.80 | $4.00 | 200K |
| Claude Opus 4 | $15.00 | $75.00 | 200K |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M |
| DeepSeek V3 | $0.27 | $1.10 | 128K |
Notice something critical: output tokens cost 3-5x more than input tokens across nearly every provider. This is not a minor detail. It is the single most important fact that should shape your entire optimization strategy. A chatbot generating verbose 800-token responses when 200 tokens would suffice is literally burning money at a 4x multiplier.
In a typical SaaS application with 10,000 daily active users, each making an average of 5 LLM requests per session with a 2,000-token prompt and 500-token response, your daily token consumption looks like this: 10,000 users times 5 requests equals 50,000 requests per day. With a 2,000-token prompt and 500-token response each, that is 100 million input tokens and 25 million output tokens daily. At GPT-4o rates ($2.50 per million input, $10.00 per million output), that is $250 per day for input and $250 per day for output — roughly $15,000 per month. Scale that to 50,000 users, and you are looking at $75,000 monthly.
The good news: every strategy in this guide is something I have personally implemented. Individually, each one shaves 10-30% off the bill. Combined, I have consistently achieved 60-90% cost reductions on production workloads. Let us get into it.
Understanding Token Economics
Before optimizing anything, you need a solid mental model of what you are actually paying for. Tokens are not characters, and they are not words. They are subword units produced by a tokenizer, typically based on Byte Pair Encoding (BPE). For English text, a rough rule of thumb is that 1 token is approximately 4 characters or 0.75 words. But this varies significantly based on content type.
How Tokenization Actually Works
Each model family uses its own tokenizer. OpenAI models use cl100k_base (for GPT-4o and related models), while Anthropic uses their own tokenizer for Claude models. The tokenizer breaks text into chunks from a learned vocabulary. Common words like "the" or "and" are single tokens, while rare words get split into multiple tokens. Code is particularly token-hungry because variable names, syntax characters, and indentation all consume tokens.
Here are some concrete examples to build your intuition:
// English prose: efficient tokenization
"The quick brown fox jumps over the lazy dog"
// ~10 tokens (close to word count)
// Code: less efficient tokenization
"const handleUserAuthentication = async (req, res) => {"
// ~15 tokens (variable names get split)
// JSON: surprisingly expensive
{"user_id": 12345, "email": "john@example.com", "preferences": {"theme": "dark"}}
// ~28 tokens (all those quotes, colons, and braces)
// Minified code: even worse per semantic unit
"a.b(c,d,{e:f,g:h})"
// ~14 tokens but carries less meaning per token
"user_email": "john@example.com" uses about 10 tokens but carries the same meaning as the natural language phrase "user email is john@example.com" which uses about 8 tokens. When you are stuffing structured data into prompts, the format you choose directly impacts cost.
Input vs. Output Token Pricing
The asymmetric pricing between input and output tokens reflects the computational reality. Generating output tokens requires sequential inference through the model, while input tokens can be processed in parallel during the prefill phase. This is why output tokens cost 3-5x more. The practical implication is clear: invest more effort in reducing output tokens than input tokens. A 100-token reduction in output saves you 3-5x more than a 100-token reduction in input.
The Hidden Costs: System Prompts and Conversation History
Many developers forget that system prompts are sent with every single API call. A 500-token system prompt that seems trivial in isolation becomes significant at scale. If your application makes 1 million API calls per month, that system prompt alone costs you 500 million input tokens, which is $1,250 per month on GPT-4o. Even worse, in multi-turn conversations, the entire conversation history is re-sent with each new message. A 10-turn conversation means the first message gets billed 10 times.
Strategy 1: Prompt Engineering for Fewer Tokens
This is the single highest-impact optimization you can make, and it requires zero infrastructure changes. I have seen prompt engineering alone cut token usage by 40-60% on real projects.
System Prompt Compression
Most system prompts are written in verbose, natural language because that is how we think. But LLMs are remarkably good at interpreting compressed instructions. Here is a real-world example from a code review bot I built:
// BEFORE: 187 tokens
`You are a senior code reviewer. When reviewing code, please follow
these guidelines carefully:
1. Look for potential security vulnerabilities including SQL injection,
XSS attacks, and authentication bypasses
2. Check for performance issues such as N+1 queries, unnecessary
re-renders, and memory leaks
3. Evaluate code readability and suggest improvements to naming
conventions and code structure
4. Identify any violations of SOLID principles
5. Always provide specific, actionable feedback with code examples
Please format your response as a structured review with sections for
each category of feedback.`
// AFTER: 72 tokens
`Senior code reviewer. Check for:
- Security: SQLi, XSS, auth bypass
- Perf: N+1, re-renders, mem leaks
- Readability: naming, structure
- SOLID violations
Give specific fixes w/ code examples.
Format: structured sections per category.`
The compressed version produces nearly identical output quality. I have A/B tested this across hundreds of reviews. The key techniques are: remove filler words ("please", "carefully", "always"), use abbreviations the model understands, use lists instead of sentences, and drop redundant context (the model already knows what code review means).
Few-Shot vs. Zero-Shot Tradeoffs
Few-shot prompting (providing examples in the prompt) dramatically improves output quality for structured tasks, but each example costs tokens. The tradeoff calculation is straightforward:
// Zero-shot: cheaper per call, less consistent output
// Cost: ~50 input tokens for instruction only
"Extract the product name and price from this text. Return JSON."
// One-shot: moderate cost, much better consistency
// Cost: ~120 input tokens
"Extract the product name and price from this text. Return JSON.
Example input: 'The new MacBook Pro starts at $1,999'
Example output: {\"product\": \"MacBook Pro\", \"price\": 1999}
Now process:"
// Few-shot (3 examples): highest cost, best consistency
// Cost: ~250 input tokens
// ... three examples covering edge cases ...
My rule of thumb: use zero-shot when the task is simple and well-defined (classification, sentiment). Use one-shot when you need specific output formatting. Use few-shot only when dealing with ambiguous or domain-specific tasks where the model needs calibration. For production systems, I run a quick experiment with 100 test cases to find the minimum number of examples that achieve my quality threshold, then stick with that.
Instruction Tuning and Prompt Templates
Build a library of optimized prompt templates for your common operations. I maintain a prompt registry in our codebase that looks like this:
// prompt-registry.ts
export const PROMPTS = {
CLASSIFY_INTENT: {
system: `Classify user intent. Categories: question|complaint|request|feedback. Return JSON: {"intent":"...","confidence":0-1}`,
tokens: 28, // tracked for cost estimation
version: "2.3",
},
SUMMARIZE_TICKET: {
system: `Summarize support ticket in 2 sentences. Focus: issue, impact, urgency.`,
tokens: 16,
version: "1.7",
},
EXTRACT_ENTITIES: {
system: `Extract named entities. Return JSON array: [{"text":"...","type":"person|org|product|location"}]`,
tokens: 22,
version: "3.1",
},
} as const;
Version tracking is important because when you update a prompt, you want to A/B test it against the previous version for both quality and token usage. A prompt that uses 20% more tokens but produces 50% fewer errors might still be the right choice.
Strategy 2: Context Window Management
The biggest waste of tokens I see in production applications is sending irrelevant context. This is especially egregious in RAG (Retrieval Augmented Generation) systems and multi-turn conversations.
Only Send What Is Needed
I once audited an internal tool that was passing the entire user profile (including address history, payment methods, and login logs) into every LLM call, even though 90% of requests only needed the user's name and role. The fix was trivial and cut input tokens by 65%:
// BEFORE: sending everything
const context = await db.users.findById(userId);
const response = await llm.chat({
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: `User context: ${JSON.stringify(context)}\n\nQuestion: ${question}` }
]
});
// AFTER: send only relevant fields
const context = await db.users.findById(userId, {
select: ['name', 'role', 'department', 'recentTickets']
});
const slimContext = {
name: context.name,
role: context.role,
dept: context.department,
// Only include recent tickets if the question seems support-related
...(isSupportQuery(question) && { tickets: context.recentTickets.slice(0, 3) })
};
const response = await llm.chat({
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: `Ctx: ${JSON.stringify(slimContext)}\nQ: ${question}` }
]
});
Sliding Window for Conversations
For multi-turn conversations, the naive approach of sending the entire history is a cost disaster. Each new message re-sends every previous message. A 20-turn conversation means the first user message gets billed 20 times. Here is the sliding window approach I use:
function buildConversationContext(history, maxTokens = 4000) {
// Always include: system prompt + current message
// Then fill remaining budget with recent history, newest first
const systemTokens = estimateTokens(systemPrompt);
const currentTokens = estimateTokens(history[history.length - 1]);
let budget = maxTokens - systemTokens - currentTokens;
const includedMessages = [];
// Walk backwards through history (skip current message)
for (let i = history.length - 2; i >= 0; i--) {
const msgTokens = estimateTokens(history[i]);
if (msgTokens > budget) break;
budget -= msgTokens;
includedMessages.unshift(history[i]);
}
// If conversation is long, add a summary of dropped messages
if (includedMessages.length < history.length - 1) {
const droppedCount = history.length - 1 - includedMessages.length;
includedMessages.unshift({
role: "system",
content: `[${droppedCount} earlier messages omitted. Key topics discussed: ${extractTopics(history.slice(0, droppedCount))}]`
});
}
return [
{ role: "system", content: systemPrompt },
...includedMessages,
history[history.length - 1]
];
}
RAG vs. Context Stuffing
Context stuffing (dumping entire documents into the prompt) is the lazy approach. RAG (retrieving only relevant chunks) is the cost-effective approach. But even RAG can be wasteful if not tuned properly. Here are the numbers from one of my projects:
| Approach | Avg Input Tokens | Answer Quality | Monthly Cost (100K queries) |
|---|---|---|---|
| Full document stuffing | 12,000 | Good (but noisy) | $3,000 |
| RAG (top-10 chunks) | 4,500 | Good | $1,125 |
| RAG (top-5 chunks, reranked) | 2,200 | Better (less noise) | $550 |
| RAG (top-3 chunks, reranked + compressed) | 1,400 | Best | $350 |
The reranking step (using a cross-encoder model like Cohere Rerank or a smaller local model) costs pennies but lets you retrieve fewer, more relevant chunks. Fewer chunks means fewer input tokens, which means lower cost and often better quality because the model has less noise to filter through.
Strategy 3: Code Context Optimization
If you are building developer tools, AI-assisted coding features, or any application that sends code to an LLM, this section is worth paying close attention to. Code is one of the most token-intensive content types, and most of those tokens carry zero semantic value for the LLM.
Stripping Comments and Docstrings
Comments are written for human developers. The LLM does not need them to understand the code. In my testing, removing comments before sending code to an LLM has zero measurable impact on comprehension quality while saving 10-30% of tokens depending on the codebase:
// BEFORE: 89 tokens
/**
* Calculates the total price including tax and discounts.
* @param items - Array of cart items with price and quantity
* @param taxRate - Tax rate as a decimal (e.g., 0.08 for 8%)
* @param discount - Optional discount amount in dollars
* @returns The final total as a number
*/
function calculateTotal(items, taxRate, discount = 0) {
// Sum up all item prices multiplied by their quantities
const subtotal = items.reduce((sum, item) => sum + item.price * item.qty, 0);
// Apply tax to the subtotal
const withTax = subtotal * (1 + taxRate);
// Subtract any applicable discount
return Math.max(0, withTax - discount);
}
// AFTER: 47 tokens (47% reduction)
function calculateTotal(items, taxRate, discount = 0) {
const subtotal = items.reduce((sum, item) => sum + item.price * item.qty, 0);
const withTax = subtotal * (1 + taxRate);
return Math.max(0, withTax - discount);
}
Removing Import Boilerplate and Type Noise
In languages like TypeScript or Java, import statements and verbose type declarations can account for 20-40% of a file's tokens. If the LLM task does not specifically require understanding the full type system, strip the noise:
// BEFORE: Typical TypeScript file header (62 tokens of imports)
import { Request, Response, NextFunction } from 'express';
import { Injectable, HttpException, HttpStatus } from '@nestjs/common';
import { InjectRepository } from '@nestjs/typeorm';
import { Repository, FindOptionsWhere, In } from 'typeorm';
import { User } from '../entities/user.entity';
import { CreateUserDto, UpdateUserDto } from '../dto/user.dto';
import { PaginationDto } from '../common/dto/pagination.dto';
import { LoggerService } from '../common/services/logger.service';
// AFTER: Relevant context only (12 tokens)
// Uses: express, NestJS, TypeORM
// Entities: User (id, email, name, role)
// DTOs: CreateUserDto, UpdateUserDto
Using AI Context Optimizer Tools
Manually stripping code context is tedious and error-prone. This is exactly the kind of task that should be automated. Before sending code to an LLM API, run it through a context optimizer that removes comments, trims whitespace, strips unnecessary imports, and compresses the code while preserving its semantic meaning.
Optimize your code context before sending it to LLMs
Our AI Context Optimizer strips comments, removes boilerplate, and compresses code to reduce token usage. Process everything locally in your browser — no data sent to any server.
Try AI Context OptimizerIn my workflow, I pipe code through a context optimizer as a preprocessing step before every LLM API call that involves code analysis. The typical savings are 30-50% of input tokens for code-heavy prompts. Over a month of heavy usage, that adds up to hundreds or thousands of dollars saved.
Selective File Inclusion
When asking an LLM to analyze or modify code, resist the urge to send entire files. Use AST parsing or simple heuristics to extract only the relevant functions, classes, or blocks:
// Instead of sending an entire 500-line file, extract the relevant function
import * as parser from '@babel/parser';
import traverse from '@babel/traverse';
function extractFunction(source, functionName) {
const ast = parser.parse(source, {
sourceType: 'module',
plugins: ['typescript', 'jsx']
});
let result = null;
traverse(ast, {
FunctionDeclaration(path) {
if (path.node.id?.name === functionName) {
result = source.slice(path.node.start, path.node.end);
path.stop();
}
},
VariableDeclarator(path) {
if (path.node.id?.name === functionName &&
(path.node.init?.type === 'ArrowFunctionExpression' ||
path.node.init?.type === 'FunctionExpression')) {
const decl = path.parentPath;
result = source.slice(decl.node.start, decl.node.end);
path.stop();
}
}
});
return result;
}
Strategy 4: Caching and Memoization
Caching is the second-highest-impact optimization after prompt engineering. In many applications, a significant percentage of LLM requests are semantically identical or nearly identical. Why pay for the same answer twice?
Exact Match Caching
The simplest and most effective cache is an exact match on the prompt hash. If the same prompt has been seen before, return the cached response:
import { createHash } from 'crypto';
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
const CACHE_TTL = 3600; // 1 hour
async function cachedLLMCall(messages, model, options = {}) {
// Create a deterministic hash of the request
const cacheKey = createHash('sha256')
.update(JSON.stringify({ messages, model, ...options }))
.digest('hex');
// Check cache first
const cached = await redis.get(`llm:${cacheKey}`);
if (cached) {
const parsed = JSON.parse(cached);
parsed._cached = true; // Flag for monitoring
return parsed;
}
// Cache miss: call the API
const response = await openai.chat.completions.create({
model,
messages,
...options
});
// Cache the response
await redis.setex(
`llm:${cacheKey}`,
CACHE_TTL,
JSON.stringify(response)
);
return response;
}
In a customer support application I worked on, exact match caching alone had a 23% hit rate. That is nearly a quarter of all API calls eliminated. Common questions like "How do I reset my password?" or "What are your business hours?" hit the cache reliably.
Semantic Caching
Exact match caching misses near-duplicates. "How do I change my password?" and "I need to update my password" are semantically identical but produce different hashes. Semantic caching uses embedding similarity to catch these cases:
import { cosineSimilarity } from './utils';
const SIMILARITY_THRESHOLD = 0.95;
async function semanticCachedLLMCall(prompt, model, options = {}) {
// Generate embedding for the prompt
const embedding = await getEmbedding(prompt);
// Search for similar cached prompts using vector similarity
const similar = await vectorDB.search({
vector: embedding,
topK: 1,
threshold: SIMILARITY_THRESHOLD,
collection: 'llm_cache'
});
if (similar.length > 0) {
return {
...similar[0].metadata.response,
_cached: true,
_similarity: similar[0].score
};
}
// Cache miss: call API and store result with embedding
const response = await callLLM(prompt, model, options);
await vectorDB.upsert({
id: generateId(),
vector: embedding,
metadata: {
prompt,
response,
model,
createdAt: Date.now()
},
collection: 'llm_cache'
});
return response;
}
Semantic caching typically adds another 8-15% cache hit rate on top of exact match caching. The embedding call costs a fraction of the LLM call (around $0.02 per million tokens for embedding models like text-embedding-3-small), so the economics work out strongly in your favor.
Cache Hit Rate Benchmarks
Here are typical cache hit rates I have observed across different application types:
- Customer support bot: 30-45% hit rate (many repeated questions)
- Code review tool: 10-15% hit rate (code varies significantly)
- Content classification: 50-70% hit rate (limited input space)
- Document Q&A: 15-25% hit rate (depends on document diversity)
- Translation: 20-35% hit rate (common phrases repeat)
Strategy 5: Model Routing and Tiering
Not every request deserves your most expensive model. This is probably the most underutilized optimization I see in production systems. Teams default to GPT-4o or Claude Sonnet for everything, when 60-80% of their requests could be handled by a model that costs 10-20x less.
Task-Based Routing
The idea is simple: classify the complexity of each request and route it to the appropriate model. Here is how I implement this:
// model-router.ts
type ModelTier = 'fast' | 'standard' | 'premium';
interface RouteConfig {
model: string;
costPer1MInput: number;
costPer1MOutput: number;
}
const MODEL_TIERS: Record<ModelTier, RouteConfig> = {
fast: {
model: 'gpt-4o-mini',
costPer1MInput: 0.15,
costPer1MOutput: 0.60,
},
standard: {
model: 'gpt-4o',
costPer1MInput: 2.50,
costPer1MOutput: 10.00,
},
premium: {
model: 'claude-opus-4',
costPer1MInput: 15.00,
costPer1MOutput: 75.00,
},
};
function routeRequest(task: string, context: RequestContext): ModelTier {
// Simple classification tasks: use the cheapest model
if (['classify', 'sentiment', 'extract_entities', 'format'].includes(task)) {
return 'fast';
}
// Standard generation, summarization, Q&A
if (['summarize', 'qa', 'generate_email', 'translate'].includes(task)) {
return 'standard';
}
// Complex reasoning, code generation, multi-step analysis
if (['code_review', 'debug', 'architecture', 'legal_analysis'].includes(task)) {
return 'premium';
}
// Default to standard if unknown
return 'standard';
}
async function smartLLMCall(task: string, messages: Message[], context: RequestContext) {
const tier = routeRequest(task, context);
const config = MODEL_TIERS[tier];
const response = await callLLM({
model: config.model,
messages,
});
// Log for cost tracking
logUsage(task, tier, response.usage);
return response;
}
Model Cascading
An even smarter approach is model cascading: start with the cheapest model, and only escalate to a more expensive one if the output does not meet quality criteria. This works exceptionally well for tasks where you can programmatically verify the output:
async function cascadingLLMCall(messages, validators) {
const cascade = [
{ model: 'gpt-4o-mini', maxRetries: 1 },
{ model: 'gpt-4o', maxRetries: 1 },
{ model: 'claude-opus-4', maxRetries: 1 },
];
for (const { model, maxRetries } of cascade) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const response = await callLLM({ model, messages });
const output = response.choices[0].message.content;
// Validate the output
const isValid = validators.every(v => v(output));
if (isValid) {
console.log(`Resolved by ${model} on attempt ${attempt + 1}`);
return { output, model, attempts: attempt + 1 };
}
} catch (err) {
if (attempt === maxRetries) break;
}
}
}
throw new Error('All models failed to produce valid output');
}
// Example: JSON extraction with cascading
const result = await cascadingLLMCall(messages, [
(output) => { try { JSON.parse(output); return true; } catch { return false; } },
(output) => {
const parsed = JSON.parse(output);
return parsed.name && parsed.price !== undefined;
},
]);
In production, I have found that 70-80% of requests resolve at the cheapest tier, 15-20% at the middle tier, and only 3-5% need the premium model. The blended cost ends up being 75-85% lower than using the premium model for everything.
Strategy 6: Batching and Rate Management
Both OpenAI and Anthropic offer batch processing APIs that provide significant discounts. OpenAI's Batch API gives you a 50% discount on all models in exchange for accepting up to 24-hour turnaround time. If your use case does not require real-time responses, this is free money on the table.
Identifying Batchable Workloads
Many LLM workloads are not actually real-time. Here are common batchable tasks I have encountered:
- Content moderation queues (review within hours, not seconds)
- Nightly report generation and summarization
- Bulk classification of incoming support tickets
- SEO content generation and optimization
- Data enrichment pipelines (extracting entities from records)
- Test data generation for QA environments
Implementing Batch Processing with OpenAI
import OpenAI from 'openai';
import { writeFileSync, createReadStream } from 'fs';
const openai = new OpenAI();
// Step 1: Prepare batch file (JSONL format)
function prepareBatchFile(requests) {
const lines = requests.map((req, index) => JSON.stringify({
custom_id: `request-${index}`,
method: "POST",
url: "/v1/chat/completions",
body: {
model: "gpt-4o",
messages: req.messages,
max_tokens: req.maxTokens || 500,
}
}));
const filePath = '/tmp/batch_input.jsonl';
writeFileSync(filePath, lines.join('\n'));
return filePath;
}
// Step 2: Upload and create batch
async function submitBatch(filePath) {
const file = await openai.files.create({
file: createReadStream(filePath),
purpose: 'batch',
});
const batch = await openai.batches.create({
input_file_id: file.id,
endpoint: '/v1/chat/completions',
completion_window: '24h',
});
return batch.id;
}
// Step 3: Poll for completion
async function waitForBatch(batchId) {
while (true) {
const batch = await openai.batches.retrieve(batchId);
if (batch.status === 'completed') {
const results = await openai.files.content(batch.output_file_id);
return results.text().then(t =>
t.split('\n').filter(Boolean).map(JSON.parse)
);
}
if (batch.status === 'failed' || batch.status === 'expired') {
throw new Error(`Batch ${batchId} ${batch.status}`);
}
// Wait 60 seconds before checking again
await new Promise(r => setTimeout(r, 60000));
}
}
Async Processing Architecture
For applications that need to batch requests as they come in, I use a queue-based architecture. Requests accumulate in a queue, and a worker processes them in batches on a schedule:
// Accumulate requests and flush every 5 minutes or when batch reaches 100
class LLMBatchQueue {
constructor(options = {}) {
this.queue = [];
this.maxBatchSize = options.maxBatchSize || 100;
this.flushInterval = options.flushInterval || 300000; // 5 min
this.callbacks = new Map();
setInterval(() => this.flush(), this.flushInterval);
}
async enqueue(messages, options = {}) {
return new Promise((resolve, reject) => {
const id = crypto.randomUUID();
this.callbacks.set(id, { resolve, reject });
this.queue.push({ id, messages, options });
if (this.queue.length >= this.maxBatchSize) {
this.flush();
}
});
}
async flush() {
if (this.queue.length === 0) return;
const batch = this.queue.splice(0, this.maxBatchSize);
const batchId = await submitBatch(prepareBatchFile(batch));
// Process results when ready
const results = await waitForBatch(batchId);
for (const result of results) {
const callback = this.callbacks.get(result.custom_id);
if (callback) {
callback.resolve(result.response.body);
this.callbacks.delete(result.custom_id);
}
}
}
}
The 50% discount from batch processing is substantial. On a $10,000/month LLM bill, shifting even 40% of workloads to batch processing saves $2,000/month immediately.
Strategy 7: Output Token Control
Remember: output tokens are 3-5x more expensive than input tokens. Controlling output length is one of the most direct ways to reduce costs.
Setting max_tokens Aggressively
Always set max_tokens to the smallest value that accommodates your expected output. Most developers either leave it at the default (which can be 4,096 or more) or set it way too high "just in case." Be precise:
// Task: classify sentiment (output: 1 word)
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: 'Classify sentiment: "Great product!"' }],
max_tokens: 5, // "positive" is 1-2 tokens
});
// Task: extract JSON with 3 fields
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
max_tokens: 100, // Enough for a small JSON object
});
// Task: generate a code review (moderate output)
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
max_tokens: 800, // Prevents runaway verbose reviews
});
Structured Output Enforcement
Both OpenAI and Anthropic now support structured output through JSON mode or function calling. Using structured output not only ensures parseable responses but also prevents the model from generating unnecessary prose:
// OpenAI structured output with response_format
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: `Analyze this code for issues: ${code}`
}],
response_format: {
type: 'json_schema',
json_schema: {
name: 'code_review',
schema: {
type: 'object',
properties: {
issues: {
type: 'array',
items: {
type: 'object',
properties: {
severity: { type: 'string', enum: ['critical', 'warning', 'info'] },
line: { type: 'number' },
message: { type: 'string' },
fix: { type: 'string' }
},
required: ['severity', 'line', 'message']
}
},
overallScore: { type: 'number' },
summary: { type: 'string' }
},
required: ['issues', 'overallScore', 'summary']
}
}
},
max_tokens: 1000,
});
Structured output typically reduces output tokens by 30-50% compared to free-form text responses because the model does not generate preamble, explanations, or formatting that you did not ask for.
Stop Sequences
Stop sequences tell the model to stop generating when it encounters a specific string. This is useful for preventing the model from appending unwanted content:
// Prevent the model from adding explanations after JSON
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Return only JSON: extract name and email from...' }],
stop: ['\n\n', 'Note:', 'Explanation:'],
max_tokens: 200,
});
Measuring and Monitoring Costs
You cannot optimize what you do not measure. Every production LLM integration needs cost monitoring from day one. Here is the monitoring setup I deploy on every project.
Per-Request Cost Tracking
// llm-cost-tracker.ts
interface UsageRecord {
timestamp: Date;
model: string;
feature: string;
inputTokens: number;
outputTokens: number;
costUSD: number;
latencyMs: number;
cached: boolean;
}
const PRICING: Record<string, { input: number; output: number }> = {
'gpt-4o': { input: 2.50, output: 10.00 },
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'claude-sonnet': { input: 3.00, output: 15.00 },
'claude-haiku': { input: 0.80, output: 4.00 },
};
function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
const pricing = PRICING[model];
if (!pricing) throw new Error(`Unknown model: ${model}`);
return (inputTokens * pricing.input + outputTokens * pricing.output) / 1_000_000;
}
async function trackLLMCall(feature: string, llmCallFn: () => Promise<any>) {
const start = Date.now();
const response = await llmCallFn();
const latencyMs = Date.now() - start;
const record: UsageRecord = {
timestamp: new Date(),
model: response.model,
feature,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
costUSD: calculateCost(
response.model,
response.usage.prompt_tokens,
response.usage.completion_tokens
),
latencyMs,
cached: response._cached || false,
};
// Store for dashboarding
await metricsDB.insert('llm_usage', record);
// Alert if single request cost exceeds threshold
if (record.costUSD > 0.50) {
await alerting.warn(`High-cost LLM call: $${record.costUSD.toFixed(4)} for ${feature}`);
}
return response;
}
Building a Cost Dashboard
I aggregate usage records into a simple dashboard that shows cost broken down by feature, model, and time period. The key metrics I track are:
- Daily total cost and trend (is it growing?)
- Cost per feature (which product features are most expensive?)
- Cost per user action (what does a single chat message cost?)
- Cache hit rate (is caching working?)
- Model tier distribution (are we routing correctly?)
- Average tokens per request (are prompts getting bloated?)
- P95 cost per request (are there outlier requests burning money?)
// Example: SQL query for daily cost breakdown by feature
SELECT
DATE(timestamp) as day,
feature,
model,
COUNT(*) as request_count,
SUM(input_tokens) as total_input_tokens,
SUM(output_tokens) as total_output_tokens,
SUM(cost_usd) as total_cost,
AVG(cost_usd) as avg_cost_per_request,
SUM(CASE WHEN cached THEN 1 ELSE 0 END)::float / COUNT(*) as cache_hit_rate
FROM llm_usage
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY DATE(timestamp), feature, model
ORDER BY day DESC, total_cost DESC;
Setting Cost Budgets and Alerts
Implement hard and soft limits. A soft limit triggers an alert. A hard limit stops non-critical LLM features to prevent runaway costs:
class CostBudgetManager {
constructor(dailyBudget, monthlyBudget) {
this.dailyBudget = dailyBudget;
this.monthlyBudget = monthlyBudget;
}
async checkBudget(feature) {
const dailySpend = await this.getDailySpend();
const monthlySpend = await this.getMonthlySpend();
if (monthlySpend >= this.monthlyBudget) {
// Hard stop for non-critical features
if (!CRITICAL_FEATURES.includes(feature)) {
throw new BudgetExceededError('Monthly LLM budget exceeded');
}
}
if (dailySpend >= this.dailyBudget * 0.8) {
// Soft alert at 80% daily budget
await alerting.warn(`LLM daily spend at ${((dailySpend/this.dailyBudget)*100).toFixed(0)}%`);
}
return true;
}
}
Real-World Cost Optimization Case Study
Let me walk you through a real optimization project I completed last year. The application was an internal developer productivity tool that used LLMs for code review, documentation generation, and bug triage. Here are the before-and-after numbers.
Before: The Unoptimized State
| Metric | Value |
|---|---|
| Monthly LLM spend | $18,400 |
| Model used | GPT-4o for everything |
| Avg input tokens per request | 6,200 |
| Avg output tokens per request | 1,800 |
| Monthly requests | 420,000 |
| Cache hit rate | 0% (no caching) |
| System prompt size | 890 tokens |
Optimization Steps Applied
Step 1: Prompt compression. Rewrote all system prompts using compression techniques. The main code review prompt went from 890 tokens to 340 tokens. Estimated savings: 15% of input token costs.
Step 2: Context stripping. Implemented an automated pipeline that stripped comments, collapsed whitespace, removed import boilerplate, and extracted only relevant functions before sending code to the LLM. Average input tokens dropped from 6,200 to 3,100. Savings: another 25% of input costs.
Step 3: Model routing. Classified requests into three tiers. Bug triage (simple classification) went to GPT-4o mini. Documentation generation went to GPT-4o. Only complex code review stayed on GPT-4o (later moved some to Claude Sonnet for better results). Roughly 55% of requests moved to cheaper models.
Step 4: Exact match and semantic caching. Deployed Redis-based exact match cache and Pinecone-based semantic cache. Combined cache hit rate: 28%. That is 28% of all API calls eliminated entirely.
Step 5: Output control. Added strict max_tokens limits per task type. Switched bug triage to structured JSON output. Average output tokens dropped from 1,800 to 650.
Step 6: Batch processing. Moved nightly documentation generation runs (about 15% of total requests) to OpenAI's Batch API for the 50% discount.
After: The Optimized State
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly LLM spend | $18,400 | $3,200 | -82.6% |
| Avg input tokens/request | 6,200 | 2,800 | -54.8% |
| Avg output tokens/request | 1,800 | 650 | -63.9% |
| Effective requests (non-cached) | 420,000 | 302,400 | -28.0% |
| Blended cost per 1M input tokens | $2.50 | $0.68 | -72.8% |
| Output quality (human eval score) | 4.1/5 | 4.0/5 | -2.4% |
The output quality barely moved. That 0.1-point drop on a 5-point scale was within the margin of error of our evaluation. We achieved an 82.6% cost reduction while maintaining effectively identical quality. The total optimization effort took about three weeks of focused engineering work.
Conclusion
LLM API costs do not have to be the budget black hole that most teams experience. The seven strategies covered in this guide — prompt engineering, context window management, code context optimization, caching, model routing, batching, and output control — form a comprehensive optimization toolkit. You do not need to implement all of them at once. Start with the highest-impact, lowest-effort wins:
- Set max_tokens on every API call. This takes five minutes and immediately prevents runaway output costs.
- Compress your system prompts. An afternoon of work that typically saves 15-30% on input tokens.
- Add exact match caching. A day of work that eliminates 15-30% of all API calls.
- Implement model routing. Two to three days of work that can cut your blended model cost by 50-70%.
- Strip code context before sending to LLMs. Use a tool to automate this and save 30-50% on code-heavy prompts.
The remaining strategies (semantic caching, batch processing, advanced context management) are worth pursuing once you have the fundamentals in place and need to squeeze out the next 20-30% of savings.
One final piece of advice: always measure before and after. Set up cost tracking from the beginning of your LLM integration, not after you realize the bill is too high. The monitoring infrastructure pays for itself many times over by catching cost regressions early and giving you the data you need to make informed optimization decisions.
The LLM cost optimization game is not about being cheap. It is about being efficient. Every dollar you save on API costs is a dollar you can reinvest in building better features, running more experiments, or scaling to more users. The teams that master this will be the ones who can afford to ship AI features that their competitors cannot.
Start optimizing your LLM token usage today
Use our free AI Context Optimizer to strip comments, remove boilerplate, and compress code before sending it to any LLM API. Everything runs locally in your browser.
Open AI Context Optimizer