The LLM-ERS Pattern: Where LLMs Actually Belong in RecSys

Over the past year, I’ve had the opportunity to review over 50 different architectural proposals for Generative Recommendation Systems (RecSys). Despite the variance in domains—from e-commerce to streaming video—one extremely expensive mistake keeps appearing on whiteboards everywhere: Engineering teams are trying to use massive 70-Billion parameter Large Language Models (LLMs) to perform real-time, online ranking.

It fails in production. Almost every single time.

In this post, I want to break down exactly why this approach is doomed from an infrastructure perspective, and introduce the scalable architectural pattern that actually works: LLM-ERS (LLM-Enhanced Recommender Systems).

The Physics of Latency

Let’s look at the reality of production ML. At scale-obsessed companies like Amazon and Twitter, engineering teams operate under ruthless latency budgets. A user expects their feed to populate in roughly 100 to 200 milliseconds.

If you attempt to pass a user’s entire interaction history and a candidate pool of hundreds of items into the context window of an LLM to autoregressively generate a ranked list, you are going to hit a wall. Generating output token-by-token through a massive Transformer network takes time—often hundreds of milliseconds per token. You simply cannot run autoregressive inference over a massive catalog and stay within a 100ms budget.

That is not a modeling problem; that is physics. The bottleneck is the memory bandwidth of your GPUs.

The real issue isn’t that LLMs are weak or incapable of ranking. It’s that they are being deployed in the wrong layer of the stack.

The LLM-ERS Architecture: A Scalable Blueprint

The field is rapidly converging on a pattern that balances the semantic power of LLMs with the raw speed of traditional machine learning. We call this the LLM-Enhanced Recommender System (LLM-ERS) pattern. It divides the pipeline into three distinct layers.

Layer 1: Retrieval + Ranking (Keep it Boring, Keep it Fast)

The critical path—the actual real-time serving layer—should remain mathematically simple and blazing fast. Do not put an LLM here.

  • Rely on proven architectures like Two-Tower embeddings for dense retrieval.
  • Use heavily optimized Vector Databases (like Milvus or Pinecone) for Approximate Nearest Neighbor (ANN) search.
  • This layer’s job is to ruthlessly filter millions of candidates down to the top 1,000 items in sub-10ms. Boring architecture scales.

Layer 2: Offline Data Augmentation (The Highest ROI)

This is where LLMs truly shine. Move the heavy, expensive LLM off the critical real-time path entirely. Instead, use it asynchronously in your data pipelines to dramatically improve the quality of the training data that feeds your Layer 1 models.

  • Feature Extraction: Use LLMs to extract rich semantic tags, sentiment, and latent intent from raw item descriptions and user reviews, saving these as dense features in your feature store.
  • Synthetic Data Generation: Use an LLM to generate “hard negatives” for triplet loss training. Prompt the LLM to create plausible-but-wrong item combinations that force your traditional ranker to learn tighter, more accurate decision boundaries. This approach gives you the semantic reasoning power of an LLM with zero added latency cost at inference time.

Layer 3: Post-Ranking Personalization (Controlled LLM Use)

If you absolutely must use an LLM in the real-time path, you must tightly control its blast radius.

  • Wait until Layer 1 has narrowed the catalog down to the absolute best 3 to 5 items.
  • Pass only those top items and the user’s context to the LLM.
  • Ask the LLM to perform a lightweight, highly specific task: e.g., generating a one-sentence, personalized explanation (“Why you might like this: Because you recently bought a tent, this sleeping bag is a perfect lightweight addition”). By restricting the LLM to post-ranking presentation tweaks on a tiny context window, you get the UX benefits of GenAI without breaking your P99 latency SLA.

Conclusion

The golden rule that separates robust production systems from expensive science projects is simple: Traditional ML decides what to show; LLMs enhance how it’s understood and how the traditional ML learns.

Stop trying to force LLMs to do the heavy lifting of real-time ranking. Embrace the LLM-ERS pattern: build boring, scalable architectures for the heavy lifting, and reserve your LLM compute for offline augmentation and the final, personalized mile.

Written on May 15, 2025