Jeevachaithanyan Sivanandan

Building a Production-Shaped Multi-Model RAG Platform

In AI · 4 min read

From Local RAG to Production-Grade: My Next Step in Building AI Systems

TL;DR: I evolved a local Go/Genkit RAG prototype into a cloud-native, multi-model GCP platform with abstraction layers, evaluation and regression safety, grounded responses, and production-aware observability and cost control.

Six months ago, I wrote about building a local RAG chatbot using Genkit, Go, and Ollama.

That project was intentionally simple:

  • Local LLM
  • Local embeddings
  • Local vector store
  • Developer UI
  • Fully self-contained

It was a great way to understand retrieval-augmented generation deeply — without cloud complexity.

But after that project, one question stayed with me:

What changes when you move from a local RAG experiment to a production-shaped AI system?

Over the past few weeks, I built a second project to answer that question — a cloud-native, multi-model RAG platform designed with production realities in mind. This wasn’t about building a better chatbot. It was about engineering discipline.


What Changes Between “Local” and “Production”?

In my earlier local setup:

  • Models ran via Ollama
  • Embeddings were local
  • Vector storage was filesystem-based
  • Debugging happened in Dev UI
  • Everything was under my control

In production, things are different.

I now need to think about:

  • Model abstraction
  • Provider switching
  • Regression safety
  • Observability
  • Cloud networking
  • Secrets management
  • Cost control
  • Deployment repeatability

That shift in mindset changes the architecture significantly.


The Production-Shaped Architecture

The new system utilises the Google Cloud Platform :

  • FastAPI backend
  • Postgres + pgvector (Cloud SQL)
  • Cloud Run deployment
  • React frontend (Firebase Hosting)
  • LLM provider abstraction layer
  • Evaluation harness with regression gating
  • Structured logging and tracing

Rather than using a local vector store, embeddings are persisted in Postgres. Instead of relying on a single model, providers are abstracted. Rather than manually inspecting outputs, results are evaluated and scored. For demonstration purposes, the embedding dimension is currently set to 8 to keep the demo lightweight and cost-efficient. In future iterations, I plan to experiment with higher-dimensional embeddings to evaluate retrieval quality trade-offs.


Multi-Model Abstraction

In the local project, the model was fixed. In this version, I introduced a provider abstraction layer so that:

  • Generation calls are routed dynamically
  • Embeddings are provider-agnostic
  • Providers can be swapped without rewriting orchestration logic

Currently:

  • Gemini is validated in cloud (generation + embeddings)
  • OpenAI adapter is implemented
  • Grok provider support implemented (runtime validation depends on API key/quota)
  • Additional providers are planned

Designing this separation early makes future experimentation safer.


Evaluation & Regression Gating

One thing I realised while working locally: changing prompts or embeddings can subtly degrade output quality. In production, that’s risky.

So this system includes:

  • JSONL-based evaluation datasets
  • heuristic scoring (correctness, groundedness, hallucination)
  • Baseline comparison
  • Regression gating

This turns RAG experimentation into something measurable. In this demo, for out-of-scope questions such as “What is 2+2?” against a capitals-only document, the system still retrieves the nearest available chunks, but the model is instructed to answer only from the provided context. As a result, it abstains and responds that the provided context does not contain the required information. This is exactly the kind of grounded behaviour we want from a RAG system, even when retrieval is imperfect.

Evaluation Screenshot Evaluation Screenshot


Cloud Realities

Deploying to Cloud Run + Cloud SQL introduced practical constraints:

  • Secret management (via Secret Manager)
  • Database connectivity configuration
  • Cold start behaviour
  • Logging for remote debugging
  • Billing-aware infrastructure decisions

For example, since this runs on a personal GCP account, I designed the system to fail predictably when Cloud SQL is paused outside testing windows. The frontend is hosted on Firebase Hosting. For this kind of production-style frontend deployment, Firebase provides a very straightforward and reliable workflow, especially when combined with Cloud Run backends.


Observability

In the local Genkit setup, debugging was visual and immediate.

In the cloud, you need:

  • Structured logs
  • Request-level tracing
  • Retrieval inspection
  • Eval dashboards

Without visibility, a RAG pipeline becomes a black box. Incorporating observability early on made troubleshooting significantly easier.


Takeaways from this project

Moving from local RAG → production-shaped RAG reinforced a few lessons:

  • Model abstraction is easier to design upfront than retrofit later.
  • Evaluation must be a first-class concern.
  • Retrieval transparency is essential for debugging.
  • Cloud environments expose architectural weaknesses quickly.
  • Production constraints influence design decisions more than model quality does.

The biggest shift wasn’t technical — it was mental.

Next Iterations

There are several areas I plan to expand:

  • Adding additional providers for broader benchmarking

  • Introducing rate limiting and gateway restrictions for public demo protection

  • Implementing billing safeguards to prevent uncontrolled usage

  • CI is in place; deployments are currently manual via repeatable scripts, with full CD planned next.


#AIEngineering #RAG #LLM #CloudArchitecture #SoftwareEngineering #GCP