Build an Inference Cache to Save Costs in High-Traffic LLM Apps


In this article, you will learn how to add both exact-match and semantic inference caching to large language model applications to reduce latency and API costs at scale.

Topics we will cover include:

  • Why repeated queries in high-traffic apps waste time and money.
  • How to build a minimal exact-match cache and measure the impact.
  • How to implement a semantic cache with embeddings and cosine similarity.

Alright, let’s get to it.

Build an Inference Cache to Save Costs in High-Traffic LLM Apps
Image by Editor

Introduction

Large language models (LLMs) are widely used in applications like chatbots, customer support, code assistants, and more. These applications often serve millions of queries per day. In high-traffic apps, it’s very common for many users to ask the same or similar questions. Now think about it: is it really smart to call the LLM every single time when these models aren’t free and add latency to responses? Logically, no.

Take a customer service bot as an example. Thousands of users might ask questions every day, and many of those questions are repeated:

  • “What’s your refund policy?”
  • “How do I reset my password?”
  • “What’s the delivery time?”

If every single query is sent to the LLM, you’re just burning through your API budget unnecessarily. Each repeated request costs the same, even though the model has already generated that answer before. That’s where inference caching comes in. You can think of it as memory where you store the most common questions and reuse the results. In this article, I’ll walk you through a high-level overview with code. We’ll start with a single LLM call, simulate what high-traffic apps look like, build a simple cache, and then take a look at a more advanced version you’d want in production. Let’s get started.

Setup

Install dependencies. I am using Google Colab for this demo. We’ll use the OpenAI Python client:

Set your OpenAI API key:

Step 1: A Simple LLM Call

This function sends a prompt to the model and prints how long it takes:

Output:

This works fine for one call. But what if the same question is asked over and over?

Step 2: Simulating Repeated Questions

Let’s create a small list of user queries. Some are repeated, some are new:

Let’s see what happens if we call the LLM for each:

Output:

Every time, the LLM is called again. Even though two queries are identical, we’re paying for both. With thousands of users, these costs can skyrocket.

Step 3: Adding an Inference Cache (Exact Match)

We can fix this with a dictionary-based cache as a naive solution:

Output:

Now:

  • The first time “What is your refund policy?” is asked, it calls the LLM.
  • The second time, it instantly retrieves from cache.

This saves cost and reduces latency dramatically.

Step 4: The Problem with Exact Matching

Exact matching works only when the query text is identical. Let’s see an example:

Output:

Both queries ask about refunds, but since the text is slightly different, our cache misses. That means we still pay for the LLM. This is a big problem in the real world because users phrase questions differently.

Step 5: Semantic Caching with Embeddings

To fix this, we can use semantic caching. Instead of checking if text is identical, we check if queries are similar in meaning. We can use embeddings for this:

Output:

Even though the second query is worded differently, the semantic cache recognizes its similarity and reuses the answer.

Conclusion

If you’re building customer support bots, AI agents, or any high-traffic LLM app, caching should be one of the first optimizations you put in place.

  • Exact cache saves cost for identical queries.
  • Semantic cache saves cost for meaningfully similar queries.
  • Together, they can massively reduce API calls in high-traffic apps.

In real-world production apps, you’d store embeddings in a vector database like FAISS, Pinecone, or Weaviate for fast similarity search. But even this small demo shows how much cost and time you can save.

Leave a Reply

Your email address will not be published. Required fields are marked *