From Local RAG to Production-Grade: My Next Step in Building AI Systems

TL;DR: I evolved a local Go/Genkit RAG prototype into a cloud-native, multi-model GCP platform with abstraction layers, evaluation and regression safety, grounded responses, and production-aware observability and cost control.

Six months ago, I wrote about building a local RAG chatbot using Genkit, Go, and Ollama.

That project was intentionally simple:

Local LLM
Local embeddings
Local vector store
Developer UI
Fully self-contained

It was a great way to understand retrieval-augmented generation deeply — without cloud complexity.

But after that project, one question stayed with me:

What changes when you move from a local RAG experiment to a production-shaped AI system?

Over the past few weeks, I built a second project to answer that question — a cloud-native, multi-model RAG platform designed with production realities in mind. This wasn’t about building a better chatbot. It was about engineering discipline.

What Changes Between “Local” and “Production”?

In my earlier local setup:

Models ran via Ollama
Embeddings were local
Vector storage was filesystem-based
Debugging happened in Dev UI
Everything was under my control

In production, things are different.

I now need to think about:

Model abstraction
Provider switching
Regression safety
Observability
Cloud networking
Secrets management
Cost control
Deployment repeatability

That shift in mindset changes the architecture significantly.

The Production-Shaped Architecture

The new system utilises the Google Cloud Platform :

FastAPI backend
Postgres + pgvector (Cloud SQL)
Cloud Run deployment
React frontend (Firebase Hosting)
LLM provider abstraction layer
Evaluation harness with regression gating
Structured logging and tracing

Rather than using a local vector store, embeddings are persisted in Postgres. Instead of relying on a single model, providers are abstracted. Rather than manually inspecting outputs, results are evaluated and scored. For demonstration purposes, the embedding dimension is currently set to 8 to keep the demo lightweight and cost-efficient. In future iterations, I plan to experiment with higher-dimensional embeddings to evaluate retrieval quality trade-offs.

Multi-Model Abstraction

In the local project, the model was fixed. In this version, I introduced a provider abstraction layer so that:

Generation calls are routed dynamically
Embeddings are provider-agnostic
Providers can be swapped without rewriting orchestration logic

Currently:

Gemini is validated in cloud (generation + embeddings)
OpenAI adapter is implemented
Grok provider support implemented (runtime validation depends on API key/quota)
Additional providers are planned

Designing this separation early makes future experimentation safer.

Evaluation & Regression Gating

One thing I realised while working locally: changing prompts or embeddings can subtly degrade output quality. In production, that’s risky.

So this system includes:

JSONL-based evaluation datasets
heuristic scoring (correctness, groundedness, hallucination)
Baseline comparison
Regression gating

This turns RAG experimentation into something measurable. In this demo, for out-of-scope questions such as “What is 2+2?” against a capitals-only document, the system still retrieves the nearest available chunks, but the model is instructed to answer only from the provided context. As a result, it abstains and responds that the provided context does not contain the required information. This is exactly the kind of grounded behaviour we want from a RAG system, even when retrieval is imperfect.

Evaluation Screenshot

Cloud Realities

Deploying to Cloud Run + Cloud SQL introduced practical constraints:

Secret management (via Secret Manager)
Database connectivity configuration
Cold start behaviour
Logging for remote debugging
Billing-aware infrastructure decisions

For example, since this runs on a personal GCP account, I designed the system to fail predictably when Cloud SQL is paused outside testing windows. The frontend is hosted on Firebase Hosting. For this kind of production-style frontend deployment, Firebase provides a very straightforward and reliable workflow, especially when combined with Cloud Run backends.

Observability

In the local Genkit setup, debugging was visual and immediate.

In the cloud, you need:

Structured logs
Request-level tracing
Retrieval inspection
Eval dashboards

Without visibility, a RAG pipeline becomes a black box. Incorporating observability early on made troubleshooting significantly easier.

Takeaways from this project

Moving from local RAG → production-shaped RAG reinforced a few lessons:

Model abstraction is easier to design upfront than retrofit later.
Evaluation must be a first-class concern.
Retrieval transparency is essential for debugging.
Cloud environments expose architectural weaknesses quickly.
Production constraints influence design decisions more than model quality does.

The biggest shift wasn’t technical — it was mental.

Next Iterations

There are several areas I plan to expand:

Adding additional providers for broader benchmarking
Introducing rate limiting and gateway restrictions for public demo protection
Implementing billing safeguards to prevent uncontrolled usage
CI is in place; deployments are currently manual via repeatable scripts, with full CD planned next.

#AIEngineering #RAG #LLM #CloudArchitecture #SoftwareEngineering #GCP