How to Reduce LLM API Costs: Token Optimization Strategies

A practical, battle-tested guide to cutting your LLM API bills by 60-90% without sacrificing output quality. Seven strategies I have refined over years of building production AI features.

18 min read

Table of Contents

  1. Introduction: Why LLM API Costs Spiral Out of Control
  2. Understanding Token Economics
  3. Strategy 1: Prompt Engineering for Fewer Tokens
  4. Strategy 2: Context Window Management
  5. Strategy 3: Code Context Optimization
  6. Strategy 4: Caching and Memoization
  7. Strategy 5: Model Routing and Tiering
  8. Strategy 6: Batching and Rate Management
  9. Strategy 7: Output Token Control
  10. Measuring and Monitoring Costs
  11. Real-World Cost Optimization Case Study
  12. Conclusion

Introduction: Why LLM API Costs Spiral Out of Control

The first time I integrated an LLM into a production application, the bill at the end of month one was $2,400. I had budgeted $300. That was three years ago, and the lesson cost me more than money. It cost me sleep, a very awkward conversation with my manager, and a weekend spent rewriting half the integration layer. Since then, I have built LLM-powered features across five different products, and cost optimization has become one of my core competencies. This guide distills everything I have learned.

Here is the uncomfortable truth: LLM APIs are priced per token, and tokens accumulate far faster than most developers expect. Let us look at the current pricing landscape for the most popular models as of early 2026:

Model Input (per 1M tokens) Output (per 1M tokens) Context Window
GPT-4o $2.50 $10.00 128K
GPT-4o mini $0.15 $0.60 128K
Claude 3.5 Sonnet $3.00 $15.00 200K
Claude 3.5 Haiku $0.80 $4.00 200K
Claude Opus 4 $15.00 $75.00 200K
Gemini 2.0 Flash $0.10 $0.40 1M
DeepSeek V3 $0.27 $1.10 128K

Notice something critical: output tokens cost 3-5x more than input tokens across nearly every provider. This is not a minor detail. It is the single most important fact that should shape your entire optimization strategy. A chatbot generating verbose 800-token responses when 200 tokens would suffice is literally burning money at a 4x multiplier.

In a typical SaaS application with 10,000 daily active users, each making an average of 5 LLM requests per session with a 2,000-token prompt and 500-token response, your daily token consumption looks like this: 10,000 users times 5 requests equals 50,000 requests per day. With a 2,000-token prompt and 500-token response each, that is 100 million input tokens and 25 million output tokens daily. At GPT-4o rates ($2.50 per million input, $10.00 per million output), that is $250 per day for input and $250 per day for output — roughly $15,000 per month. Scale that to 50,000 users, and you are looking at $75,000 monthly.

The good news: every strategy in this guide is something I have personally implemented. Individually, each one shaves 10-30% off the bill. Combined, I have consistently achieved 60-90% cost reductions on production workloads. Let us get into it.

Understanding Token Economics

Before optimizing anything, you need a solid mental model of what you are actually paying for. Tokens are not characters, and they are not words. They are subword units produced by a tokenizer, typically based on Byte Pair Encoding (BPE). For English text, a rough rule of thumb is that 1 token is approximately 4 characters or 0.75 words. But this varies significantly based on content type.

How Tokenization Actually Works

Each model family uses its own tokenizer. OpenAI models use cl100k_base (for GPT-4o and related models), while Anthropic uses their own tokenizer for Claude models. The tokenizer breaks text into chunks from a learned vocabulary. Common words like "the" or "and" are single tokens, while rare words get split into multiple tokens. Code is particularly token-hungry because variable names, syntax characters, and indentation all consume tokens.

Here are some concrete examples to build your intuition:

// English prose: efficient tokenization
"The quick brown fox jumps over the lazy dog"
// ~10 tokens (close to word count)

// Code: less efficient tokenization
"const handleUserAuthentication = async (req, res) => {"
// ~15 tokens (variable names get split)

// JSON: surprisingly expensive
{"user_id": 12345, "email": "john@example.com", "preferences": {"theme": "dark"}}
// ~28 tokens (all those quotes, colons, and braces)

// Minified code: even worse per semantic unit
"a.b(c,d,{e:f,g:h})"
// ~14 tokens but carries less meaning per token
Key insight: Not all tokens carry equal semantic value. A JSON key-value pair like "user_email": "john@example.com" uses about 10 tokens but carries the same meaning as the natural language phrase "user email is john@example.com" which uses about 8 tokens. When you are stuffing structured data into prompts, the format you choose directly impacts cost.

Input vs. Output Token Pricing

The asymmetric pricing between input and output tokens reflects the computational reality. Generating output tokens requires sequential inference through the model, while input tokens can be processed in parallel during the prefill phase. This is why output tokens cost 3-5x more. The practical implication is clear: invest more effort in reducing output tokens than input tokens. A 100-token reduction in output saves you 3-5x more than a 100-token reduction in input.

The Hidden Costs: System Prompts and Conversation History

Many developers forget that system prompts are sent with every single API call. A 500-token system prompt that seems trivial in isolation becomes significant at scale. If your application makes 1 million API calls per month, that system prompt alone costs you 500 million input tokens, which is $1,250 per month on GPT-4o. Even worse, in multi-turn conversations, the entire conversation history is re-sent with each new message. A 10-turn conversation means the first message gets billed 10 times.

Strategy 1: Prompt Engineering for Fewer Tokens

This is the single highest-impact optimization you can make, and it requires zero infrastructure changes. I have seen prompt engineering alone cut token usage by 40-60% on real projects.

System Prompt Compression

Most system prompts are written in verbose, natural language because that is how we think. But LLMs are remarkably good at interpreting compressed instructions. Here is a real-world example from a code review bot I built:

// BEFORE: 187 tokens
`You are a senior code reviewer. When reviewing code, please follow
these guidelines carefully:
1. Look for potential security vulnerabilities including SQL injection,
   XSS attacks, and authentication bypasses
2. Check for performance issues such as N+1 queries, unnecessary
   re-renders, and memory leaks
3. Evaluate code readability and suggest improvements to naming
   conventions and code structure
4. Identify any violations of SOLID principles
5. Always provide specific, actionable feedback with code examples
Please format your response as a structured review with sections for
each category of feedback.`

// AFTER: 72 tokens
`Senior code reviewer. Check for:
- Security: SQLi, XSS, auth bypass
- Perf: N+1, re-renders, mem leaks
- Readability: naming, structure
- SOLID violations
Give specific fixes w/ code examples.
Format: structured sections per category.`

The compressed version produces nearly identical output quality. I have A/B tested this across hundreds of reviews. The key techniques are: remove filler words ("please", "carefully", "always"), use abbreviations the model understands, use lists instead of sentences, and drop redundant context (the model already knows what code review means).

Few-Shot vs. Zero-Shot Tradeoffs

Few-shot prompting (providing examples in the prompt) dramatically improves output quality for structured tasks, but each example costs tokens. The tradeoff calculation is straightforward:

// Zero-shot: cheaper per call, less consistent output
// Cost: ~50 input tokens for instruction only
"Extract the product name and price from this text. Return JSON."

// One-shot: moderate cost, much better consistency
// Cost: ~120 input tokens
"Extract the product name and price from this text. Return JSON.

Example input: 'The new MacBook Pro starts at $1,999'
Example output: {\"product\": \"MacBook Pro\", \"price\": 1999}

Now process:"

// Few-shot (3 examples): highest cost, best consistency
// Cost: ~250 input tokens
// ... three examples covering edge cases ...

My rule of thumb: use zero-shot when the task is simple and well-defined (classification, sentiment). Use one-shot when you need specific output formatting. Use few-shot only when dealing with ambiguous or domain-specific tasks where the model needs calibration. For production systems, I run a quick experiment with 100 test cases to find the minimum number of examples that achieve my quality threshold, then stick with that.

Instruction Tuning and Prompt Templates

Build a library of optimized prompt templates for your common operations. I maintain a prompt registry in our codebase that looks like this:

// prompt-registry.ts
export const PROMPTS = {
  CLASSIFY_INTENT: {
    system: `Classify user intent. Categories: question|complaint|request|feedback. Return JSON: {"intent":"...","confidence":0-1}`,
    tokens: 28,  // tracked for cost estimation
    version: "2.3",
  },
  SUMMARIZE_TICKET: {
    system: `Summarize support ticket in 2 sentences. Focus: issue, impact, urgency.`,
    tokens: 16,
    version: "1.7",
  },
  EXTRACT_ENTITIES: {
    system: `Extract named entities. Return JSON array: [{"text":"...","type":"person|org|product|location"}]`,
    tokens: 22,
    version: "3.1",
  },
} as const;

Version tracking is important because when you update a prompt, you want to A/B test it against the previous version for both quality and token usage. A prompt that uses 20% more tokens but produces 50% fewer errors might still be the right choice.

Strategy 2: Context Window Management

The biggest waste of tokens I see in production applications is sending irrelevant context. This is especially egregious in RAG (Retrieval Augmented Generation) systems and multi-turn conversations.

Only Send What Is Needed

I once audited an internal tool that was passing the entire user profile (including address history, payment methods, and login logs) into every LLM call, even though 90% of requests only needed the user's name and role. The fix was trivial and cut input tokens by 65%:

// BEFORE: sending everything
const context = await db.users.findById(userId);
const response = await llm.chat({
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: `User context: ${JSON.stringify(context)}\n\nQuestion: ${question}` }
  ]
});

// AFTER: send only relevant fields
const context = await db.users.findById(userId, {
  select: ['name', 'role', 'department', 'recentTickets']
});
const slimContext = {
  name: context.name,
  role: context.role,
  dept: context.department,
  // Only include recent tickets if the question seems support-related
  ...(isSupportQuery(question) && { tickets: context.recentTickets.slice(0, 3) })
};
const response = await llm.chat({
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: `Ctx: ${JSON.stringify(slimContext)}\nQ: ${question}` }
  ]
});

Sliding Window for Conversations

For multi-turn conversations, the naive approach of sending the entire history is a cost disaster. Each new message re-sends every previous message. A 20-turn conversation means the first user message gets billed 20 times. Here is the sliding window approach I use:

function buildConversationContext(history, maxTokens = 4000) {
  // Always include: system prompt + current message
  // Then fill remaining budget with recent history, newest first
  const systemTokens = estimateTokens(systemPrompt);
  const currentTokens = estimateTokens(history[history.length - 1]);
  let budget = maxTokens - systemTokens - currentTokens;

  const includedMessages = [];

  // Walk backwards through history (skip current message)
  for (let i = history.length - 2; i >= 0; i--) {
    const msgTokens = estimateTokens(history[i]);
    if (msgTokens > budget) break;
    budget -= msgTokens;
    includedMessages.unshift(history[i]);
  }

  // If conversation is long, add a summary of dropped messages
  if (includedMessages.length < history.length - 1) {
    const droppedCount = history.length - 1 - includedMessages.length;
    includedMessages.unshift({
      role: "system",
      content: `[${droppedCount} earlier messages omitted. Key topics discussed: ${extractTopics(history.slice(0, droppedCount))}]`
    });
  }

  return [
    { role: "system", content: systemPrompt },
    ...includedMessages,
    history[history.length - 1]
  ];
}

RAG vs. Context Stuffing

Context stuffing (dumping entire documents into the prompt) is the lazy approach. RAG (retrieving only relevant chunks) is the cost-effective approach. But even RAG can be wasteful if not tuned properly. Here are the numbers from one of my projects:

Approach Avg Input Tokens Answer Quality Monthly Cost (100K queries)
Full document stuffing 12,000 Good (but noisy) $3,000
RAG (top-10 chunks) 4,500 Good $1,125
RAG (top-5 chunks, reranked) 2,200 Better (less noise) $550
RAG (top-3 chunks, reranked + compressed) 1,400 Best $350

The reranking step (using a cross-encoder model like Cohere Rerank or a smaller local model) costs pennies but lets you retrieve fewer, more relevant chunks. Fewer chunks means fewer input tokens, which means lower cost and often better quality because the model has less noise to filter through.

Strategy 3: Code Context Optimization

If you are building developer tools, AI-assisted coding features, or any application that sends code to an LLM, this section is worth paying close attention to. Code is one of the most token-intensive content types, and most of those tokens carry zero semantic value for the LLM.

Stripping Comments and Docstrings

Comments are written for human developers. The LLM does not need them to understand the code. In my testing, removing comments before sending code to an LLM has zero measurable impact on comprehension quality while saving 10-30% of tokens depending on the codebase:

// BEFORE: 89 tokens
/**
 * Calculates the total price including tax and discounts.
 * @param items - Array of cart items with price and quantity
 * @param taxRate - Tax rate as a decimal (e.g., 0.08 for 8%)
 * @param discount - Optional discount amount in dollars
 * @returns The final total as a number
 */
function calculateTotal(items, taxRate, discount = 0) {
  // Sum up all item prices multiplied by their quantities
  const subtotal = items.reduce((sum, item) => sum + item.price * item.qty, 0);
  // Apply tax to the subtotal
  const withTax = subtotal * (1 + taxRate);
  // Subtract any applicable discount
  return Math.max(0, withTax - discount);
}

// AFTER: 47 tokens (47% reduction)
function calculateTotal(items, taxRate, discount = 0) {
  const subtotal = items.reduce((sum, item) => sum + item.price * item.qty, 0);
  const withTax = subtotal * (1 + taxRate);
  return Math.max(0, withTax - discount);
}

Removing Import Boilerplate and Type Noise

In languages like TypeScript or Java, import statements and verbose type declarations can account for 20-40% of a file's tokens. If the LLM task does not specifically require understanding the full type system, strip the noise:

// BEFORE: Typical TypeScript file header (62 tokens of imports)
import { Request, Response, NextFunction } from 'express';
import { Injectable, HttpException, HttpStatus } from '@nestjs/common';
import { InjectRepository } from '@nestjs/typeorm';
import { Repository, FindOptionsWhere, In } from 'typeorm';
import { User } from '../entities/user.entity';
import { CreateUserDto, UpdateUserDto } from '../dto/user.dto';
import { PaginationDto } from '../common/dto/pagination.dto';
import { LoggerService } from '../common/services/logger.service';

// AFTER: Relevant context only (12 tokens)
// Uses: express, NestJS, TypeORM
// Entities: User (id, email, name, role)
// DTOs: CreateUserDto, UpdateUserDto

Using AI Context Optimizer Tools

Manually stripping code context is tedious and error-prone. This is exactly the kind of task that should be automated. Before sending code to an LLM API, run it through a context optimizer that removes comments, trims whitespace, strips unnecessary imports, and compresses the code while preserving its semantic meaning.

Optimize your code context before sending it to LLMs

Our AI Context Optimizer strips comments, removes boilerplate, and compresses code to reduce token usage. Process everything locally in your browser — no data sent to any server.

Try AI Context Optimizer

In my workflow, I pipe code through a context optimizer as a preprocessing step before every LLM API call that involves code analysis. The typical savings are 30-50% of input tokens for code-heavy prompts. Over a month of heavy usage, that adds up to hundreds or thousands of dollars saved.

Selective File Inclusion

When asking an LLM to analyze or modify code, resist the urge to send entire files. Use AST parsing or simple heuristics to extract only the relevant functions, classes, or blocks:

// Instead of sending an entire 500-line file, extract the relevant function
import * as parser from '@babel/parser';
import traverse from '@babel/traverse';

function extractFunction(source, functionName) {
  const ast = parser.parse(source, {
    sourceType: 'module',
    plugins: ['typescript', 'jsx']
  });

  let result = null;
  traverse(ast, {
    FunctionDeclaration(path) {
      if (path.node.id?.name === functionName) {
        result = source.slice(path.node.start, path.node.end);
        path.stop();
      }
    },
    VariableDeclarator(path) {
      if (path.node.id?.name === functionName &&
          (path.node.init?.type === 'ArrowFunctionExpression' ||
           path.node.init?.type === 'FunctionExpression')) {
        const decl = path.parentPath;
        result = source.slice(decl.node.start, decl.node.end);
        path.stop();
      }
    }
  });

  return result;
}

Strategy 4: Caching and Memoization

Caching is the second-highest-impact optimization after prompt engineering. In many applications, a significant percentage of LLM requests are semantically identical or nearly identical. Why pay for the same answer twice?

Exact Match Caching

The simplest and most effective cache is an exact match on the prompt hash. If the same prompt has been seen before, return the cached response:

import { createHash } from 'crypto';
import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);
const CACHE_TTL = 3600; // 1 hour

async function cachedLLMCall(messages, model, options = {}) {
  // Create a deterministic hash of the request
  const cacheKey = createHash('sha256')
    .update(JSON.stringify({ messages, model, ...options }))
    .digest('hex');

  // Check cache first
  const cached = await redis.get(`llm:${cacheKey}`);
  if (cached) {
    const parsed = JSON.parse(cached);
    parsed._cached = true; // Flag for monitoring
    return parsed;
  }

  // Cache miss: call the API
  const response = await openai.chat.completions.create({
    model,
    messages,
    ...options
  });

  // Cache the response
  await redis.setex(
    `llm:${cacheKey}`,
    CACHE_TTL,
    JSON.stringify(response)
  );

  return response;
}

In a customer support application I worked on, exact match caching alone had a 23% hit rate. That is nearly a quarter of all API calls eliminated. Common questions like "How do I reset my password?" or "What are your business hours?" hit the cache reliably.

Semantic Caching

Exact match caching misses near-duplicates. "How do I change my password?" and "I need to update my password" are semantically identical but produce different hashes. Semantic caching uses embedding similarity to catch these cases:

import { cosineSimilarity } from './utils';

const SIMILARITY_THRESHOLD = 0.95;

async function semanticCachedLLMCall(prompt, model, options = {}) {
  // Generate embedding for the prompt
  const embedding = await getEmbedding(prompt);

  // Search for similar cached prompts using vector similarity
  const similar = await vectorDB.search({
    vector: embedding,
    topK: 1,
    threshold: SIMILARITY_THRESHOLD,
    collection: 'llm_cache'
  });

  if (similar.length > 0) {
    return {
      ...similar[0].metadata.response,
      _cached: true,
      _similarity: similar[0].score
    };
  }

  // Cache miss: call API and store result with embedding
  const response = await callLLM(prompt, model, options);

  await vectorDB.upsert({
    id: generateId(),
    vector: embedding,
    metadata: {
      prompt,
      response,
      model,
      createdAt: Date.now()
    },
    collection: 'llm_cache'
  });

  return response;
}

Semantic caching typically adds another 8-15% cache hit rate on top of exact match caching. The embedding call costs a fraction of the LLM call (around $0.02 per million tokens for embedding models like text-embedding-3-small), so the economics work out strongly in your favor.

Cache Hit Rate Benchmarks

Here are typical cache hit rates I have observed across different application types:

Cache invalidation strategy: Set TTLs based on how frequently the underlying data changes. For factual queries against a static knowledge base, a 24-hour TTL works well. For real-time data (stock prices, live dashboards), use a shorter TTL of 5-15 minutes or skip caching entirely. I also recommend implementing a manual cache purge endpoint for cases when you update your system prompt or model version.

Strategy 5: Model Routing and Tiering

Not every request deserves your most expensive model. This is probably the most underutilized optimization I see in production systems. Teams default to GPT-4o or Claude Sonnet for everything, when 60-80% of their requests could be handled by a model that costs 10-20x less.

Task-Based Routing

The idea is simple: classify the complexity of each request and route it to the appropriate model. Here is how I implement this:

// model-router.ts
type ModelTier = 'fast' | 'standard' | 'premium';

interface RouteConfig {
  model: string;
  costPer1MInput: number;
  costPer1MOutput: number;
}

const MODEL_TIERS: Record<ModelTier, RouteConfig> = {
  fast: {
    model: 'gpt-4o-mini',
    costPer1MInput: 0.15,
    costPer1MOutput: 0.60,
  },
  standard: {
    model: 'gpt-4o',
    costPer1MInput: 2.50,
    costPer1MOutput: 10.00,
  },
  premium: {
    model: 'claude-opus-4',
    costPer1MInput: 15.00,
    costPer1MOutput: 75.00,
  },
};

function routeRequest(task: string, context: RequestContext): ModelTier {
  // Simple classification tasks: use the cheapest model
  if (['classify', 'sentiment', 'extract_entities', 'format'].includes(task)) {
    return 'fast';
  }

  // Standard generation, summarization, Q&A
  if (['summarize', 'qa', 'generate_email', 'translate'].includes(task)) {
    return 'standard';
  }

  // Complex reasoning, code generation, multi-step analysis
  if (['code_review', 'debug', 'architecture', 'legal_analysis'].includes(task)) {
    return 'premium';
  }

  // Default to standard if unknown
  return 'standard';
}

async function smartLLMCall(task: string, messages: Message[], context: RequestContext) {
  const tier = routeRequest(task, context);
  const config = MODEL_TIERS[tier];

  const response = await callLLM({
    model: config.model,
    messages,
  });

  // Log for cost tracking
  logUsage(task, tier, response.usage);

  return response;
}

Model Cascading

An even smarter approach is model cascading: start with the cheapest model, and only escalate to a more expensive one if the output does not meet quality criteria. This works exceptionally well for tasks where you can programmatically verify the output:

async function cascadingLLMCall(messages, validators) {
  const cascade = [
    { model: 'gpt-4o-mini', maxRetries: 1 },
    { model: 'gpt-4o', maxRetries: 1 },
    { model: 'claude-opus-4', maxRetries: 1 },
  ];

  for (const { model, maxRetries } of cascade) {
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        const response = await callLLM({ model, messages });
        const output = response.choices[0].message.content;

        // Validate the output
        const isValid = validators.every(v => v(output));
        if (isValid) {
          console.log(`Resolved by ${model} on attempt ${attempt + 1}`);
          return { output, model, attempts: attempt + 1 };
        }
      } catch (err) {
        if (attempt === maxRetries) break;
      }
    }
  }

  throw new Error('All models failed to produce valid output');
}

// Example: JSON extraction with cascading
const result = await cascadingLLMCall(messages, [
  (output) => { try { JSON.parse(output); return true; } catch { return false; } },
  (output) => {
    const parsed = JSON.parse(output);
    return parsed.name && parsed.price !== undefined;
  },
]);

In production, I have found that 70-80% of requests resolve at the cheapest tier, 15-20% at the middle tier, and only 3-5% need the premium model. The blended cost ends up being 75-85% lower than using the premium model for everything.

Strategy 6: Batching and Rate Management

Both OpenAI and Anthropic offer batch processing APIs that provide significant discounts. OpenAI's Batch API gives you a 50% discount on all models in exchange for accepting up to 24-hour turnaround time. If your use case does not require real-time responses, this is free money on the table.

Identifying Batchable Workloads

Many LLM workloads are not actually real-time. Here are common batchable tasks I have encountered:

Implementing Batch Processing with OpenAI

import OpenAI from 'openai';
import { writeFileSync, createReadStream } from 'fs';

const openai = new OpenAI();

// Step 1: Prepare batch file (JSONL format)
function prepareBatchFile(requests) {
  const lines = requests.map((req, index) => JSON.stringify({
    custom_id: `request-${index}`,
    method: "POST",
    url: "/v1/chat/completions",
    body: {
      model: "gpt-4o",
      messages: req.messages,
      max_tokens: req.maxTokens || 500,
    }
  }));

  const filePath = '/tmp/batch_input.jsonl';
  writeFileSync(filePath, lines.join('\n'));
  return filePath;
}

// Step 2: Upload and create batch
async function submitBatch(filePath) {
  const file = await openai.files.create({
    file: createReadStream(filePath),
    purpose: 'batch',
  });

  const batch = await openai.batches.create({
    input_file_id: file.id,
    endpoint: '/v1/chat/completions',
    completion_window: '24h',
  });

  return batch.id;
}

// Step 3: Poll for completion
async function waitForBatch(batchId) {
  while (true) {
    const batch = await openai.batches.retrieve(batchId);

    if (batch.status === 'completed') {
      const results = await openai.files.content(batch.output_file_id);
      return results.text().then(t =>
        t.split('\n').filter(Boolean).map(JSON.parse)
      );
    }

    if (batch.status === 'failed' || batch.status === 'expired') {
      throw new Error(`Batch ${batchId} ${batch.status}`);
    }

    // Wait 60 seconds before checking again
    await new Promise(r => setTimeout(r, 60000));
  }
}

Async Processing Architecture

For applications that need to batch requests as they come in, I use a queue-based architecture. Requests accumulate in a queue, and a worker processes them in batches on a schedule:

// Accumulate requests and flush every 5 minutes or when batch reaches 100
class LLMBatchQueue {
  constructor(options = {}) {
    this.queue = [];
    this.maxBatchSize = options.maxBatchSize || 100;
    this.flushInterval = options.flushInterval || 300000; // 5 min
    this.callbacks = new Map();

    setInterval(() => this.flush(), this.flushInterval);
  }

  async enqueue(messages, options = {}) {
    return new Promise((resolve, reject) => {
      const id = crypto.randomUUID();
      this.callbacks.set(id, { resolve, reject });
      this.queue.push({ id, messages, options });

      if (this.queue.length >= this.maxBatchSize) {
        this.flush();
      }
    });
  }

  async flush() {
    if (this.queue.length === 0) return;

    const batch = this.queue.splice(0, this.maxBatchSize);
    const batchId = await submitBatch(prepareBatchFile(batch));

    // Process results when ready
    const results = await waitForBatch(batchId);
    for (const result of results) {
      const callback = this.callbacks.get(result.custom_id);
      if (callback) {
        callback.resolve(result.response.body);
        this.callbacks.delete(result.custom_id);
      }
    }
  }
}

The 50% discount from batch processing is substantial. On a $10,000/month LLM bill, shifting even 40% of workloads to batch processing saves $2,000/month immediately.

Strategy 7: Output Token Control

Remember: output tokens are 3-5x more expensive than input tokens. Controlling output length is one of the most direct ways to reduce costs.

Setting max_tokens Aggressively

Always set max_tokens to the smallest value that accommodates your expected output. Most developers either leave it at the default (which can be 4,096 or more) or set it way too high "just in case." Be precise:

// Task: classify sentiment (output: 1 word)
const response = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'Classify sentiment: "Great product!"' }],
  max_tokens: 5,  // "positive" is 1-2 tokens
});

// Task: extract JSON with 3 fields
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  max_tokens: 100,  // Enough for a small JSON object
});

// Task: generate a code review (moderate output)
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  max_tokens: 800,  // Prevents runaway verbose reviews
});

Structured Output Enforcement

Both OpenAI and Anthropic now support structured output through JSON mode or function calling. Using structured output not only ensures parseable responses but also prevents the model from generating unnecessary prose:

// OpenAI structured output with response_format
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: `Analyze this code for issues: ${code}`
  }],
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'code_review',
      schema: {
        type: 'object',
        properties: {
          issues: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                severity: { type: 'string', enum: ['critical', 'warning', 'info'] },
                line: { type: 'number' },
                message: { type: 'string' },
                fix: { type: 'string' }
              },
              required: ['severity', 'line', 'message']
            }
          },
          overallScore: { type: 'number' },
          summary: { type: 'string' }
        },
        required: ['issues', 'overallScore', 'summary']
      }
    }
  },
  max_tokens: 1000,
});

Structured output typically reduces output tokens by 30-50% compared to free-form text responses because the model does not generate preamble, explanations, or formatting that you did not ask for.

Stop Sequences

Stop sequences tell the model to stop generating when it encounters a specific string. This is useful for preventing the model from appending unwanted content:

// Prevent the model from adding explanations after JSON
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Return only JSON: extract name and email from...' }],
  stop: ['\n\n', 'Note:', 'Explanation:'],
  max_tokens: 200,
});

Measuring and Monitoring Costs

You cannot optimize what you do not measure. Every production LLM integration needs cost monitoring from day one. Here is the monitoring setup I deploy on every project.

Per-Request Cost Tracking

// llm-cost-tracker.ts
interface UsageRecord {
  timestamp: Date;
  model: string;
  feature: string;
  inputTokens: number;
  outputTokens: number;
  costUSD: number;
  latencyMs: number;
  cached: boolean;
}

const PRICING: Record<string, { input: number; output: number }> = {
  'gpt-4o':        { input: 2.50,  output: 10.00 },
  'gpt-4o-mini':   { input: 0.15,  output: 0.60  },
  'claude-sonnet': { input: 3.00,  output: 15.00 },
  'claude-haiku':  { input: 0.80,  output: 4.00  },
};

function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
  const pricing = PRICING[model];
  if (!pricing) throw new Error(`Unknown model: ${model}`);

  return (inputTokens * pricing.input + outputTokens * pricing.output) / 1_000_000;
}

async function trackLLMCall(feature: string, llmCallFn: () => Promise<any>) {
  const start = Date.now();
  const response = await llmCallFn();
  const latencyMs = Date.now() - start;

  const record: UsageRecord = {
    timestamp: new Date(),
    model: response.model,
    feature,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    costUSD: calculateCost(
      response.model,
      response.usage.prompt_tokens,
      response.usage.completion_tokens
    ),
    latencyMs,
    cached: response._cached || false,
  };

  // Store for dashboarding
  await metricsDB.insert('llm_usage', record);

  // Alert if single request cost exceeds threshold
  if (record.costUSD > 0.50) {
    await alerting.warn(`High-cost LLM call: $${record.costUSD.toFixed(4)} for ${feature}`);
  }

  return response;
}

Building a Cost Dashboard

I aggregate usage records into a simple dashboard that shows cost broken down by feature, model, and time period. The key metrics I track are:

// Example: SQL query for daily cost breakdown by feature
SELECT
  DATE(timestamp) as day,
  feature,
  model,
  COUNT(*) as request_count,
  SUM(input_tokens) as total_input_tokens,
  SUM(output_tokens) as total_output_tokens,
  SUM(cost_usd) as total_cost,
  AVG(cost_usd) as avg_cost_per_request,
  SUM(CASE WHEN cached THEN 1 ELSE 0 END)::float / COUNT(*) as cache_hit_rate
FROM llm_usage
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY DATE(timestamp), feature, model
ORDER BY day DESC, total_cost DESC;

Setting Cost Budgets and Alerts

Implement hard and soft limits. A soft limit triggers an alert. A hard limit stops non-critical LLM features to prevent runaway costs:

class CostBudgetManager {
  constructor(dailyBudget, monthlyBudget) {
    this.dailyBudget = dailyBudget;
    this.monthlyBudget = monthlyBudget;
  }

  async checkBudget(feature) {
    const dailySpend = await this.getDailySpend();
    const monthlySpend = await this.getMonthlySpend();

    if (monthlySpend >= this.monthlyBudget) {
      // Hard stop for non-critical features
      if (!CRITICAL_FEATURES.includes(feature)) {
        throw new BudgetExceededError('Monthly LLM budget exceeded');
      }
    }

    if (dailySpend >= this.dailyBudget * 0.8) {
      // Soft alert at 80% daily budget
      await alerting.warn(`LLM daily spend at ${((dailySpend/this.dailyBudget)*100).toFixed(0)}%`);
    }

    return true;
  }
}

Real-World Cost Optimization Case Study

Let me walk you through a real optimization project I completed last year. The application was an internal developer productivity tool that used LLMs for code review, documentation generation, and bug triage. Here are the before-and-after numbers.

Before: The Unoptimized State

Metric Value
Monthly LLM spend $18,400
Model used GPT-4o for everything
Avg input tokens per request 6,200
Avg output tokens per request 1,800
Monthly requests 420,000
Cache hit rate 0% (no caching)
System prompt size 890 tokens

Optimization Steps Applied

Step 1: Prompt compression. Rewrote all system prompts using compression techniques. The main code review prompt went from 890 tokens to 340 tokens. Estimated savings: 15% of input token costs.

Step 2: Context stripping. Implemented an automated pipeline that stripped comments, collapsed whitespace, removed import boilerplate, and extracted only relevant functions before sending code to the LLM. Average input tokens dropped from 6,200 to 3,100. Savings: another 25% of input costs.

Step 3: Model routing. Classified requests into three tiers. Bug triage (simple classification) went to GPT-4o mini. Documentation generation went to GPT-4o. Only complex code review stayed on GPT-4o (later moved some to Claude Sonnet for better results). Roughly 55% of requests moved to cheaper models.

Step 4: Exact match and semantic caching. Deployed Redis-based exact match cache and Pinecone-based semantic cache. Combined cache hit rate: 28%. That is 28% of all API calls eliminated entirely.

Step 5: Output control. Added strict max_tokens limits per task type. Switched bug triage to structured JSON output. Average output tokens dropped from 1,800 to 650.

Step 6: Batch processing. Moved nightly documentation generation runs (about 15% of total requests) to OpenAI's Batch API for the 50% discount.

After: The Optimized State

Metric Before After Change
Monthly LLM spend $18,400 $3,200 -82.6%
Avg input tokens/request 6,200 2,800 -54.8%
Avg output tokens/request 1,800 650 -63.9%
Effective requests (non-cached) 420,000 302,400 -28.0%
Blended cost per 1M input tokens $2.50 $0.68 -72.8%
Output quality (human eval score) 4.1/5 4.0/5 -2.4%

The output quality barely moved. That 0.1-point drop on a 5-point scale was within the margin of error of our evaluation. We achieved an 82.6% cost reduction while maintaining effectively identical quality. The total optimization effort took about three weeks of focused engineering work.

The compounding effect: These strategies do not just add up — they compound. Caching eliminates requests before they hit the model router. The model router sends surviving requests to cheaper models. Prompt compression and context stripping reduce the token count of each remaining request. Output control reduces the most expensive tokens (output). Each layer multiplies the savings of the previous one.

Conclusion

LLM API costs do not have to be the budget black hole that most teams experience. The seven strategies covered in this guide — prompt engineering, context window management, code context optimization, caching, model routing, batching, and output control — form a comprehensive optimization toolkit. You do not need to implement all of them at once. Start with the highest-impact, lowest-effort wins:

  1. Set max_tokens on every API call. This takes five minutes and immediately prevents runaway output costs.
  2. Compress your system prompts. An afternoon of work that typically saves 15-30% on input tokens.
  3. Add exact match caching. A day of work that eliminates 15-30% of all API calls.
  4. Implement model routing. Two to three days of work that can cut your blended model cost by 50-70%.
  5. Strip code context before sending to LLMs. Use a tool to automate this and save 30-50% on code-heavy prompts.

The remaining strategies (semantic caching, batch processing, advanced context management) are worth pursuing once you have the fundamentals in place and need to squeeze out the next 20-30% of savings.

One final piece of advice: always measure before and after. Set up cost tracking from the beginning of your LLM integration, not after you realize the bill is too high. The monitoring infrastructure pays for itself many times over by catching cost regressions early and giving you the data you need to make informed optimization decisions.

The LLM cost optimization game is not about being cheap. It is about being efficient. Every dollar you save on API costs is a dollar you can reinvest in building better features, running more experiments, or scaling to more users. The teams that master this will be the ones who can afford to ship AI features that their competitors cannot.

Start optimizing your LLM token usage today

Use our free AI Context Optimizer to strip comments, remove boilerplate, and compress code before sending it to any LLM API. Everything runs locally in your browser.

Open AI Context Optimizer