In this article, you will learn how to add both exact-match and semantic inference caching to large language model applications to reduce latency and API costs at scale.
Topics we will cover include:
- Why repeated queries in high-traffic apps waste time and money.
- How to build a minimal exact-match cache and measure the impact.
- How to implement a semantic cache with embeddings and cosine similarity.
Alright, let’s get to it.
Build an Inference Cache to Save Costs in High-Traffic LLM Apps
Image by Editor
Introduction
Large language models (LLMs) are widely used in applications like chatbots, customer support, code assistants, and more. These applications often serve millions of queries per day. In high-traffic apps, it’s very common for many users to ask the same or similar questions. Now think about it: is it really smart to call the LLM every single time when these models aren’t free and add latency to responses? Logically, no.
Take a customer service bot as an example. Thousands of users might ask questions every day, and many of those questions are repeated:
- “What’s your refund policy?”
- “How do I reset my password?”
- “What’s the delivery time?”
If every single query is sent to the LLM, you’re just burning through your API budget unnecessarily. Each repeated request costs the same, even though the model has already generated that answer before. That’s where inference caching comes in. You can think of it as memory where you store the most common questions and reuse the results. In this article, I’ll walk you through a high-level overview with code. We’ll start with a single LLM call, simulate what high-traffic apps look like, build a simple cache, and then take a look at a more advanced version you’d want in production. Let’s get started.
Setup
Install dependencies. I am using Google Colab for this demo. We’ll use the OpenAI Python client:
Set your OpenAI API key:
import os from openai import OpenAI
os.environ[“OPENAI_API_KEY”] = “sk-your_api_key_here” client = OpenAI() |
Step 1: A Simple LLM Call
This function sends a prompt to the model and prints how long it takes:
import time
def ask_llm(prompt): start = time.time() response = client.chat.completions.create( model=“gpt-4o-mini”, messages=[{“role”: “user”, “content”: prompt}] ) end = time.time() print(f“Time: {end – start:.2f}s”) return response.choices[0].message.content
print(ask_llm(“What is your refund policy?”)) |
Output:
Time: 2.81s As an AI language model, I don‘t have a refund policy since I don’t... |
This works fine for one call. But what if the same question is asked over and over?
Step 2: Simulating Repeated Questions
Let’s create a small list of user queries. Some are repeated, some are new:
queries = [ “What is your refund policy?”, “How do I reset my password?”, “What is your refund policy?”, # repeated “What’s the delivery time?”, “How do I reset my password?”, # repeated ] |
Let’s see what happens if we call the LLM for each:
start = time.time() for q in queries: print(f“Q: {q}”) ans = ask_llm(q) print(“A:”, ans) print(“-“ * 50) end = time.time()
print(f“Total Time (no cache): {end – start:.2f}s”) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Q: What is your refund policy? Time: 2.02s A: I don‘t handle transactions or have a refund policy… ————————————————– Q: How do I reset my password? Time: 10.22s A: To reset your password, you typically need to follow… ————————————————– Q: What is your refund policy? Time: 4.66s A: I don’t handle transactions or refunds directly... ————————————————————————— Q: What’s the delivery time? Time: 5.40s A: The delivery time can vary significantly based on several factors... ————————————————————————— Q: How do I reset my password? Time: 6.34s A: To reset your password, the process typically varies... ————————————————————————— Total Time (no cache): 28.64s |
Every time, the LLM is called again. Even though two queries are identical, we’re paying for both. With thousands of users, these costs can skyrocket.
Step 3: Adding an Inference Cache (Exact Match)
We can fix this with a dictionary-based cache as a naive solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
cache = {}
def ask_llm_cached(prompt): if prompt in cache: print(“(from cache, ~0.00s)”) return cache[prompt]
ans = ask_llm(prompt) cache[prompt] = ans return ans
start = time.time() for q in queries: print(f“Q: {q}”) print(“A:”, ask_llm_cached(q)) print(“-“ * 50) end = time.time()
print(f“Total Time (exact cache): {end – start:.2f}s”) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Q: What is your refund policy? Time: 2.35s A: I don’t have a refund policy since... ————————————————————————— Q: How do I reset my password? Time: 6.42s A: Resetting your password typically depends on... ————————————————————————— Q: What is your refund policy? (from cache, ~0.00s) A: I don’t have a refund policy since... ————————————————————————— Q: What’s the delivery time? Time: 3.22s A: Delivery times can vary depending on several factors... ————————————————————————— Q: How do I reset my password? (from cache, ~0.00s) A: Resetting your password typically depends... ————————————————————————— Total Time (exact cache): 12.00s |
Now:
- The first time “What is your refund policy?” is asked, it calls the LLM.
- The second time, it instantly retrieves from cache.
This saves cost and reduces latency dramatically.
Step 4: The Problem with Exact Matching
Exact matching works only when the query text is identical. Let’s see an example:
q1 = “What is your refund policy?” q2 = “Can you explain the refund policy?”
print(ask_llm_cached(q1)) print(ask_llm_cached(q2)) # Not cached, even though it means the same! |
Output:
(from cache, ~0.00s) First: I don’t have a refund policy since...
Time: 7.93s Second: Refund policies can vary widely depending on the company... |
Both queries ask about refunds, but since the text is slightly different, our cache misses. That means we still pay for the LLM. This is a big problem in the real world because users phrase questions differently.
Step 5: Semantic Caching with Embeddings
To fix this, we can use semantic caching. Instead of checking if text is identical, we check if queries are similar in meaning. We can use embeddings for this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
import numpy as np
semantic_cache = {}
def embed(text): emb = client.embeddings.create( model=“text-embedding-3-small”, input=text ) return np.array(emb.data[0].embedding)
def ask_llm_semantic(prompt, threshold=0.85): prompt_emb = embed(prompt)
for cached_q, (cached_emb, cached_ans) in semantic_cache.items(): sim = np.dot(prompt_emb, cached_emb) / ( np.linalg.norm(prompt_emb) * np.linalg.norm(cached_emb) ) if sim > threshold: print(f“(from semantic cache, matched with ‘{cached_q}’, ~0.00s)”) return cached_ans
start = time.time() ans = ask_llm(prompt) end = time.time() semantic_cache[prompt] = (prompt_emb, ans) print(f“Time (new LLM call): {end – start:.2f}s”) return ans
print(“First:”, ask_llm_semantic(“What is your refund policy?”)) print(“Second:”, ask_llm_semantic(“Can you explain the refund policy?”)) # Should hit semantic cache |
Output:
Time: 4.54s Time (new LLM call): 4.54s First: As an AI, I don‘t have a refund policy since I don’t sell...
(from semantic cache, matched with ‘What is your refund policy?’, ~0.00s) Second: As an AI, I don‘t have a refund policy since I don’t sell... |
Even though the second query is worded differently, the semantic cache recognizes its similarity and reuses the answer.
Conclusion
If you’re building customer support bots, AI agents, or any high-traffic LLM app, caching should be one of the first optimizations you put in place.
- Exact cache saves cost for identical queries.
- Semantic cache saves cost for meaningfully similar queries.
- Together, they can massively reduce API calls in high-traffic apps.
In real-world production apps, you’d store embeddings in a vector database like FAISS, Pinecone, or Weaviate for fast similarity search. But even this small demo shows how much cost and time you can save.