Last quarter, my AI inference costs hit $100,000 annualized. I started small. Six months earlier, I was spending $200 a month on Claude. Then I added three agent subscriptions : Codex, Gemini, & Claude Code. I was paying $600 a month. Next I started using AI to transform my todo list into my done list, increasing tasks to 31 per day. $92 daily inference invoices started arriving. Then $400 per month on browser agents. Within two quarters, my inference spend grew from $7,200 to $43,000 to over $100,000 run rate. So I migrated to an open source model. It took a weekend. The key was building the right testing loops : I had six months of historical task data, so I could replay requests through the new model & hill-climb to parity with AI agents working through the night. By Sunday evening, they performed identically. At 12% of the cost. I’m not the only one paying attention to this cost. Technology companies are adding a fourth component to engineering compensation : salary, bonus, options, & inference costs. Levels.fyi pegs the 75th percentile software engineer salary at $375k. Add $100k in inference & the fully loaded cost is $475k. That’s 21% in tokens. The question CFOs will pose : what am I getting for all this inference spend? Can I do it cheaper? If the metric for a new cloud is gross profit per GPU hour, the employee equivalent is : productive work per dollar of inference. For me, the answer is 31 tasks a day at $12k annually. The engineer still burning $100k? They’d better be 8x more productive! Will you be paid in tokens? In 2026, you likely will start to be.
Optimizing Technology Spending
বিশেষজ্ঞ পেশাদারদের থেকে সেরা LinkedIn সামগ্রী এক্সপ্লোর করুন।
-
-
You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer
-
If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression: - Prompt Pruning, remove irrelevant history or system tokens - Prompt Summarization, use model-generated summaries as input - Soft Prompt Compression, encode static context using embeddings - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization: - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization: - Post-Training, no retraining needed - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification: - Weight Pruning, Sparse Attention → Structure Optimization: - Neural Architecture Search, Structure Factorization → Knowledge Distillation: - White-box, student learns internal states - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!
-
After optimizing costs for many AI systems, I've developed a systematic approach that consistently delivers cost reductions of 60-80%. Here's my playbook, in order of least to most effort: Step 1: Optimizing Inference Throughput Start here for the biggest wins with least effort. Enabling caching (LiteLLM (YC W23), Zilliz) and strategic batch processing can reduce costs by a lot with very little effort. I have seen teams cut costs by half simply by implementing caching and batching requests that don't require real-time results. Step 2: Maximizing Token Efficiency This can give you an additional 50% cost savings. Prompt engineering, automated compression (ScaleDown), and structured outputs can cut token usage without sacrificing quality. Small changes in how you craft prompts can lead to massive savings at scale. Step 3: Model Orchestration Use routers and cascades to send prompts to the cheapest and most effective model for that prompt (OpenRouter, Martian). Why use GPT-4 for simple classification when GPT-3.5 will do? Smart routing ensures you're not overpaying for intelligence you don't need. Step 4: Self-Hosting I only suggest self-hosting for teams at scale because of the complexities involved. This requires more technical investment upfront but pays dividends for high-volume applications. The key is tackling these layers systematically. Most teams jump straight to self-hosting or model switching, but the real savings come from optimizing throughput and token efficiency first. What's your experience with AI cost optimization?
-
Caching Architecture Is the New Backbone of LLM Systems Performance, cost, and latency all depend on it. If your LLM bill is rising every month, you’re not alone. More usage More tokens More cost But here’s the catch. Most of that compute is repeated work. Same prompts Same context Same patterns And we recompute everything...every time... This is where inference caching changes the game. Not a new model Not a new architecture Just smarter reuse There are three layers that matter: 1. KV Caching - Happens inside the model - Stores attention states during generation - Prevents recomputing tokens within a request You’re already using it. You just don’t see it. 2. Prefix Caching - Extends this across requests - If your system prompt or reference context is constant, process it once → reuse it Simple rule Static content at the top Dynamic content at the end High impact. Almost zero effort 3. Semantic Caching - This is where things get interesting - Store past queries and responses - Retrieve based on meaning, not exact match In many cases, you can skip the LLM call entirely. Massive cost savings for support bots, FAQs, repeated queries. The real power comes from layering them - KV runs by default - Prefix reduces repeated context cost - Semantic avoids calls altogether Most teams focus on model quality. But in production, efficiency is what scales. Because in real systems: The cheapest token is the one you never generate.
-
You are a CIO. You wonder why you've already spent your AI budget. It's the models, stupid. In our ongoing study of 100+ enterprises, one pattern keeps emerging as the single biggest cost lever in production AI. And it's not caching, not prompt engineering, not negotiating better rates. It's model routing. The problem: → Most teams default to the most powerful model for everything. It's the safe bet. It's also the expensive one. → 80% of enterprise workloads don't need a frontier model. They need a $0.001 call, not a $0.05 one. → Nobody builds the routing logic until the bill forces them to. What the cost-efficient ones do: => Cascade architecture. Cheapest model first. Escalate only when confidence is low. 90% of queries never escalate. => Fine-tuned smaller models. 95% of the quality at 5% of the cost. At scale, that's the difference between viable and bankrupt. => Complexity classification. Simple factual query? Tiny model. Multi-turn reasoning? Frontier. The routing decision outweighs every other optimization combined. => Automated routing. Real-time decisions on model, latency, cost, and regulatory requirements. One enterprise is 60% toward fully automated selection. The biggest Return on Inference doesn't come from spending less. It comes from spending smart. Invest in what actually matters: better user experiences and faster innovation. The question isn't whether you need model routing. It's how many millions you'll burn before you build it.
-
𝗔𝗿𝗲 𝘆𝗼𝘂 𝗽𝗿𝗼𝗮𝗰𝘁𝗶𝘃𝗲𝗹𝘆 𝗺𝗮𝗻𝗮𝗴𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗦𝗼𝘂𝗿𝗰𝗲-𝘁𝗼-𝗣𝗮𝘆 𝘁𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗰𝗼𝘀𝘁𝘀? If not, why let savings from smart Procurement slip away due to outdated technology or suboptimal use? S2P technology plays a central role in cost management, yet many companies lack a strategic approach to continuously assess and optimise their tech stack. Companies can adopt Bain & Co’s "𝗥𝗲𝗱𝘂𝗰𝗲, 𝗥𝗲𝗽𝗹𝗮𝗰𝗲, 𝗮𝗻𝗱 𝗥𝗲𝘁𝗵𝗶𝗻𝗸" model to continuously evaluate their technology infrastructure and costs, ensuring a more optimised and sustainable cost profile. Here is the model in action for Source to Pay technology cost optimisation: ▪️ 𝗥𝗲𝗱𝘂𝗰𝗲 to recover 10 to 20% of costs through short-term actions such as - adjusting licenses to match actual usage and adoption patterns - discontinuing features or functionalities that add little value - switching off modules where business capabilities have not yet caught up Avoid over-licensing by matching user access to actual needs, ensuring modules align with Procurement’s readiness. ▪️ 𝗥𝗲𝗽𝗹𝗮𝗰𝗲 to yield 20 to 30% of savings by - transitioning to cost-optimal, flexible solutions and getting out of lock-ins - switching subscription models when premium offerings are unnecessary - consolidating overlapping tools that offer similar features For example, merge multiple eSourcing tools into a primary platform and adopt a tender-based pricing for niche auction needs. This helps to adjust the cost profile of your Source to Pay technology with the actual needs. ▪️ 𝗥𝗲𝘁𝗵𝗶𝗻𝗸 to realise up to 40% cost optimisation by: - reimagining the architecture with a modular, composable design - automating and orchestrating processes and integrating new digital tools - reevaluate the mix of best-of-breed solutions vs integrated suites A new Procurement strategy requires a fresh look at the S2P tech stack to ensure it adapts and supports growth cost-effectively, while offering flexibility through additional digital levers like AI and automation. 𝗢𝗽𝘁𝗶𝗺𝗶𝘀𝗶𝗻𝗴 𝗦𝟮𝗣 𝘁𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗶𝘀 𝗮 𝗰𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗷𝗼𝘂𝗿𝗻𝗲𝘆, 𝗻𝗼𝘁 𝗮 𝗼𝗻𝗲-𝘁𝗶𝗺𝗲 𝗲𝗳𝗳𝗼𝗿𝘁, especially with contractual commitments, sunk costs, and change management challenges. Rather than following IT preferences and standards, it’s about keeping technology fresh and aligned with business needs as they evolve. ❓How do you manage your S2P technology to adapt to changing business needs while maintaining cost efficiency.
-
Most engineers think model cost is about API tokens or inference time. In reality, it’s about how your requests compete for GPU scheduling and how effectively your data stays hot in cache. Here’s the untold truth 👇 1. 𝐄𝐯𝐞𝐫𝐲 𝐦𝐢𝐥𝐥𝐢𝐬𝐞𝐜𝐨𝐧𝐝 𝐨𝐧 𝐚 𝐆𝐏𝐔 𝐢𝐬 𝐚 𝐰𝐚𝐫 𝐟𝐨𝐫 𝐩𝐫𝐢𝐨𝐫𝐢𝐭𝐲. . Your model doesn’t just “run.” It waits its turn. Schedulers (like Kubernetes device plugins, Triton schedulers, or CUDA MPS) decide who gets compute time — and how often. If your jobs are fragmented or unbatched, you’re paying for idle silicon. That’s like renting a Ferrari to sit in traffic. 2. 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐥𝐚𝐲𝐞𝐫𝐬 𝐪𝐮𝐢𝐞𝐭𝐥𝐲 𝐝𝐞𝐜𝐢𝐝𝐞 𝐲𝐨𝐮𝐫 𝐛𝐮𝐫𝐧 𝐫𝐚𝐭𝐞. Intermediate activations, embeddings, and KV caches live in high-bandwidth memory. If your model keeps reloading them between requests — you’re paying full price every time. That’s why serving infra (like vLLM, DeepSpeed, or FasterTransformer) focuses more on cache reuse than raw FLOPS. The real optimization isn’t in “faster models.” It’s in smarter scheduling and cache locality. Your cost per token can drop 50% with zero model changes — just better orchestration. 3. 𝐓𝐡𝐞 𝐡𝐢𝐝𝐝𝐞𝐧 𝐭𝐚𝐱: 𝐟𝐫𝐚𝐠𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐞𝐯𝐢𝐜𝐭𝐢𝐨𝐧. When too many models share the same GPU cluster, the scheduler starts slicing compute and evicting caches. This leads to context thrashing — where memory swaps cost more than inference. At scale, this kills both performance and margins. So if you’re wondering why your inference bill doubled while latency stayed the same — don’t blame the model. Blame the infrastructure design. The real bottleneck isn’t model size — it’s architectural awareness. Understanding schedulers, memory hierarchies, and caching strategies is what separates AI engineers from AI architects. And that’s exactly what we go deep into inside the Advanced System Design Cohort — a 3-month, high-intensity program for Senior, Staff, and Principal Engineers who want to master the systems that power modern AI infra. You’ll learn to think beyond API calls — about how compute, caching, and scheduling interact to define scale and cost. If you’re ready to learn the architectures behind real AI systems — there’s a form in the comments. Apply, and we’ll check if you’re a great fit. We’re selective, because this is where future technical leaders are being built.
-
AI Inference costs are killing your profit margins. Let me teach you how to reduce your Inference Overhead with Compiler & Graph Execution Running an LLM under PyTorch or TensorFlow looks simple, but the framework issues thousands of separate GPU kernel calls for every forward pass. Each kernel executes a small unit of work—like normalization or matrix multiplication—and writes the result to global GPU memory (HBM) before reading it back. While HBM bandwidth reaches 2–3 TB/s on an H100, that is 10–50x slower than the GPU’s on-chip registers. Every unnecessary trip to HBM is wasted potential. Worse, each kernel launch requires the CPU to coordinate with the GPU, adding tens of microseconds of overhead. Across thousands of tokens, this becomes milliseconds of latency. Three techniques—kernel fusion, CUDA graphs, and FlashAttention—target these bottlenecks. Kernel Fusion: Combining Operations Instead of launching separate kernels for LayerNorm and matrix multiplication, you fuse them into one. The compiler rewrites the computational graph to combine operations, ensuring intermediate results stay in the GPU’s fast on-chip registers instead of touching global HBM. This cuts memory traffic and eliminates redundant kernel launches. The tax: irregular shapes or dynamic padding can block fusion, leading to a mix of fused and unfused kernels. CUDA Graphs: Bypassing the CPU Inference involves repeating the same sequence of kernels for every generated token. Rather than the CPU re-issuing commands, CUDA graphs allow you to record the sequence once and replay it directly on the GPU. This bypasses the CPU scheduler entirely, eliminating launch overhead. The tax: graphs are tied to specific tensor shapes, requiring effective systems to capture "hot" shapes and fall back to standard execution for others. FlashAttention: Avoiding the Quadratic Wall Standard attention computes an N x N score matrix between queries and keys, which creates gigabytes of memory traffic per token. FlashAttention tiles this computation, loading small blocks of queries and keys into on-chip SRAM to compute partial attention scores incrementally. The result is mathematically identical, but the memory footprint is a fraction of the original. The tax: gains depend on sequence length, and for very short sequences, the overhead of tiling can outweigh benefits. Summary: The Performance Compound Kernel fusion ensures fewer writes and more work per cycle. CUDA graphs remove launch overhead, keeping the GPU in constant motion. FlashAttention prevents memory blowup, freeing bandwidth for compute.
-
Last week, a stakeholder asked me ? “Can we track software license usage in real-time… without buying another tool?” It’s a question I hear a lot as a ServiceNow Architect. So I showed them what we could do using Software Asset Management (SAM) within ServiceNow. We connected to SCCM and Azure AD, normalized the software data using the ServiceNow Content Library, and enabled reclamation rules to automatically free up unused licenses. 🎯 The result? We identified $87K worth of underutilized licenses—in 2 days—without any third-party tools. It wasn't just about cost savings. It was about: Enforcing license compliance Automating reclaim processes Giving Procurement real-time insights 💬 Have you used SAM to reclaim unused licenses? What’s your biggest challenge with license optimization? #ServiceNow #SAM #ITAM #DigitalTransformation #LicenseManagement #ITOperations #TechLeadership