Building a local RAG chatbot is a great way to learn about AI. RAG stands for Retrieval-Augmented Generation. At its core, it’s a technique that enhances the capabilities of Large Language Models (LLMs) by giving them access to external, up-to-date information before they generate a response. We’ll use Genkit from Google, the Go programming language, and Ollama to run our models locally. This practical, end-to-end guide will walk us through the process. First, we’ll understand what Genkit is and why it’s so useful. Then, we’ll see why Go is a perfect fit for this kind of project. Finally, we’ll build a minimal Retrieval-Augmented Generation (RAG) application. This app will use a local LLM from Ollama and a simple HTML chat interface to hit our Go endpoint. Let’s get to it!
What is Genkit?
Genkit is Google’s open-source framework for building full-stack, production-grade AI features. Think of Genkit as a toolkit for building AI-powered apps. It’s an open-source framework from Google that helps us create full-stack, production-ready AI features. It gives us a single, unified API to work with various model providers, so we’re not tied to just one.
Here are the core concepts we’ll use:
- Flows: These are like building blocks for our AI logic. They’re type-safe and help us organize our work, like retrieving context, calling a model, and post-processing the result. We can run these flows in the Developer UI or deploy them as web endpoints later.
- RAG (Retrieval-Augmented Generation): Genkit provides all the helpers and plugins we need to bring our own data to an LLM. It makes it easy to set up embedders, indexers, and retrievers so our bot can find and use relevant information.
- Dev UI + CLI: This is where the magic happens for fast development. We can start a local web app that connects to our running process. From there, we can inspect our models, flows, and traces to quickly debug and iterate.
Genkit is great for several key use cases:
- Chatbots and assistants. We can build smart chatbots that are grounded in our own documents, knowledge bases, or even data from our SaaS applications using RAG. This means the bot will give us answers based on our own information, not just what it was trained on.
- Automations and agents. We can create intelligent agents that call external tools or APIs. This allows us to automate complex tasks, like an agent that books a flight or sends an email. We can also use it for multi-agent patterns, where different agents work together to solve a problem.
- Structured and multimodal generation. We get a consistent API for generating content with different providers. This is especially useful when we want to handle more than just text, like images or audio. Genkit also helps us manage our prompts efficiently with Dotprompt.
- Rapid development to production. We can iterate incredibly fast with the local Dev UI, testing and refining our flows on the fly. Once a flow is ready, it’s designed to be deployed as a web endpoint, so we have a clear path from a local prototype to a production-ready application.
why Go is an excellent choice for our Genkit applications:
- Reliability: Go gives us type-safe flows and JSON schema validation. This means we get strong contracts for our inputs and outputs, which makes our code more reliable and easier to maintain.
- Performance: Go is all about performance and simplicity. It compiles into small binaries with low overhead, making it perfect for creating fast, straightforward HTTP servers. This is ideal whether we’re self-hosting our AI service or running it in a container. It’s good to know that Genkit for Go is officially supported and stable, so we can rely on it.
- Flexibility: With Go, we get a unified model interface. This is a huge benefit because we can switch between different model providers like Google AI, Vertex AI, OpenAI, or Ollama without having to rewrite our core orchestration code. It keeps our application flexible and ready for future changes.
minimal Go + Genkit + Ollama RAG example.
Here’s what we’re going to build and what it will do for us:
First, we’ll index all our local HTML documents from a ./docs/ folder. We’ll convert these files into numerical representations (embeddings) and store them in a vector database. To do this, we’ll use an Ollama embedding model like nomic-embed-text.
Next, we’ll define a retriever and a flow. This flow will take a question, use our retriever to find the most relevant document chunks from our vector store, and then use that retrieved context to generate an answer with a local Ollama LLM like phi3:mini.
Finally, we’ll set up a tiny REST endpoint at POST /ask. This will allow us to call our new RAG application from a simple HTML and JavaScript page. At the same time, the Genkit Developer UI will still run, so we can see what’s happening behind the scenes.
A quick note on the models: Ollama is the tool that lets us run these powerful models right on our own machine. It exposes a simple HTTP API. The nomic-embed-text model is a great choice for this because it’s specifically designed for creating high-quality embeddings from text. We’ll use the keep_alive parameter to make sure our models stay warmed up, which helps us avoid slow responses on the first request.
Go: main.go (core app with Genkit + flow)
package main
import (
"context"
localapi "example/my-genkit-app/internal" // your local REST wrapper
"fmt"
"log"
"os"
"os/signal"
"path/filepath"
"strings"
"syscall"
"github.com/firebase/genkit/go/ai"
"github.com/firebase/genkit/go/genkit"
"github.com/firebase/genkit/go/plugins/localvec"
"github.com/firebase/genkit/go/plugins/ollama"
)
var htmlRetriever ai.Retriever
func main() {
ctx := context.Background()
// Ollama plugin (local LLMs via http://localhost:11434)
ollamaPlugin := &ollama.Ollama{
ServerAddress: "http://localhost:11434",
Timeout: 120, // seconds; avoid short timeouts on first token
}
genk := genkit.Init(ctx, genkit.WithPlugins(ollamaPlugin))
// Register the LLM and embedder you’ll use
if ollamaPlugin.DefineModel(genk, ollama.ModelDefinition{
Name: "phi3:mini",
Type: "llm",
}, nil) == nil {
log.Fatal("failed to define model phi3:mini")
}
emb := ollamaPlugin.DefineEmbedder(genk, "http://localhost:11434", "nomic-embed-text", nil)
if emb == nil {
log.Fatal("failed to define embedder nomic-embed-text")
}
// Vector store + retriever (local directory)
indexer, retriever, err := localvec.DefineRetriever(genk, "html-rag-store", localvec.Config{
Dir: "./docs_vec",
Embedder: emb,
}, nil)
if err != nil || indexer == nil {
log.Fatalf("DefineRetriever error: %v", err)
}
htmlRetriever = retriever
fmt.Println("...Loading and indexing HTML docs")
if err := indexHTMLDocuments(ctx, indexer, "./docs"); err != nil {
log.Fatal("Failed to index documents: ", err)
}
// Flow: retrieve → augment prompt → generate
genkit.DefineFlow(genk, "ragHTMLDoc", func(ctx context.Context, question string) (string, error) {
return ragHTMLDocFlow(ctx, genk, question)
})
// Start a tiny REST API: POST /ask {"question":"..."}
localapi.Start(genk, ragHTMLDocFlow)
fmt.Println(" App is running. Open http://localhost:4000")
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
<-sig
fmt.Println("Shutting down…")
}
func ragHTMLDocFlow(ctx context.Context, genk *genkit.Genkit, question string) (string, error) {
// 1) Retrieve top-K chunks
retrieved, err := htmlRetriever.Retrieve(ctx, &ai.RetrieverRequest{
Query: &ai.Document{Content: []*ai.Part{{Text: question}}},
Options: &localvec.RetrieverOptions{K: 3},
})
if err != nil {
return "", fmt.Errorf("retrieval failed: %w", err)
}
// 2) Build context
var b strings.Builder
for i, d := range retrieved.Documents {
if len(d.Content) > 0 && d.Content[0] != nil && d.Content[0].Text != "" {
fmt.Fprintf(&b, "Chunk %d:\n%s\n\n---\n\n", i+1, d.Content[0].Text)
}
}
prompt := fmt.Sprintf(`Answer the question based ONLY on the context.
If the answer is not in the context, say "I don't know."
Context:
%s
Question: %s
Answer:`, b.String(), question)
// 3) Generate with the local model
out, err := genkit.Generate(ctx, genk,
ai.WithPrompt(prompt),
ai.WithModel(ollama.Model(genk, "phi3:mini")),
)
if err != nil {
return "", fmt.Errorf("LLM generation failed: %w", err)
}
return out.Text(), nil
}
func indexHTMLDocuments(ctx context.Context, ds *localvec.DocStore, dir string) error {
var files []string
err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
if err != nil { return err }
if !info.IsDir() && strings.HasSuffix(strings.ToLower(info.Name()), ".html") {
files = append(files, path)
}
return nil
})
if err != nil { return err }
if len(files) == 0 { return fmt.Errorf("no HTML files found in %s", dir) }
var docs []*ai.Document
for _, p := range files {
body, err := os.ReadFile(p)
if err != nil { return fmt.Errorf("read %s: %w", p, err) }
docs = append(docs, &ai.Document{
Content: []*ai.Part{{Text: string(body)}},
Metadata: map[string]any{"source": p},
})
}
return localvec.Index(ctx, docs, ds)
}
Using this structure is a smart approach for a few reasons. It keeps the Genkit Dev UI happy because it can automatically discover our flow and give us all the useful debugging and tracing tools. At the same time, we’re building a clean /ask endpoint, which is perfect for our own front-end UI. This two-part approach is the recommended way to organize our logic. We get to use all the Genkit features for development while building a clear, standard web API that we can easily deploy later as a web server or in a serverless environment like Cloud Functions.
Go: internal/api.go (a tiny REST wrapper)
package localapi
import (
"context"
"encoding/json"
"log"
"net/http"
"time"
"github.com/firebase/genkit/go/genkit"
)
type Answerer func(ctx context.Context, g *genkit.Genkit, q string) (string, error)
func Start(g *genkit.Genkit, answer Answerer) {
mux := http.NewServeMux()
mux.HandleFunc("/ask", func(w http.ResponseWriter, r *http.Request) {
// CORS for local dev
w.Header().Set("Access-Control-Allow-Origin", "*")
w.Header().Set("Access-Control-Allow-Headers", "Content-Type")
w.Header().Set("Access-Control-Allow-Methods", "POST, OPTIONS")
if r.Method == http.MethodOptions {
w.WriteHeader(http.StatusNoContent)
return
}
if r.Method != http.MethodPost {
w.Header().Set("Allow", "POST, OPTIONS")
http.Error(w, "use POST", http.StatusMethodNotAllowed)
return
}
var req struct{ Question string `json:"question"` }
if err := json.NewDecoder(http.MaxBytesReader(w, r.Body, 1<<20)).Decode(&req); err != nil || req.Question == "" {
http.Error(w, `invalid JSON: {"question":"..."}`, http.StatusBadRequest)
return
}
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Minute)
defer cancel()
ans, err := answer(ctx, g, req.Question)
if err != nil {
http.Error(w, "flow error: "+err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]any{"answer": ans})
})
go func() {
log.Println("REST API listening on http://localhost:8080 (POST /ask)")
if err := http.ListenAndServe(":8080", mux); err != nil {
log.Println("http server error:", err)
}
}()
}
Our simple user interface is a static HTML page. All it needs to do is send a fetch() request to our POST /ask endpoint and then display the answer we get back from the assistant. For local development, this is all we need to get started.
Configure & run it
Prerequisites
-
Install Go (1.21+): https://go.dev/dl/
-
Install Ollama: https://ollama.com
-
Install Genkit CLI (Dev UI & tooling):
curl -sL cli.genkit.dev | bash
Then run GenKit code with the UI attached:
curl -sL cli.genkit.dev | bash
-
Install Ollama and pull models:
curl -fsSL https://ollama.com/install.sh | sh
Run the Ollama server
ollama serve
Check Status
systemctl status ollama
Pull Required Models
ollama pull phi3:mini
ollama pull nomic-embed-text
-
Start Ollama Server (in a separate terminal)
ollama serve
Prepare Your Documents
- Place HTML files in the ./docs directory.
- The system will automatically index them on startup.
Install Go Dependencies
go mod tidy
More information about Local LLM
ollama --version
ollama list
Start Go/Genkit app with the Dev UI:
Any one of the commands below
genkit start -- go run .
genkit start -- go run main.go
You’ll see the Dev UI URL; open it to run flows and inspect traces. We can access this in the url http://localhost:4000/
In another terminal, we can just test
curl -X POST http://127.0.0.1:8080/ask \
-H 'Content-Type: application/json' \
-d '{"question":"What’s in our HTML docs?"}'
However we can also run our simple development server to run the chat app python3 -m http.server 8000 So that we can access our chat app here : http://localhost:8000/web/
Takeaways
First, we’ve seen how Genkit provides us with flows, RAG tooling, and a Developer UI. This setup lets us move quickly during development while ensuring our AI code remains organised and maintainable.
Second, Go is an excellent language for this kind of work. It’s a great runtime for building small, fast, and self-hosted AI back-ends.
Finally, by using Ollama, we can run both our LLMs and embedded completely locally. This keeps our data on our machine and helps us avoid cloud latency, which is a big advantage.
References :
https://developers.googleblog.com/en/announcing-genkit-go-10-and-enhanced-ai-assisted-development/
https://docs.ollama.com/quickstart
https://ollama.com/library/phi3
https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/