Orchestrating AI in production with Firebase Genkit
That AI tutorial that seems trivial hides a tangle of responsibilities that only emerges when code meets real traffic.

Integrating AI seems simple: call an LLM API and read the response. That simplicity is a illusion. In production, that single arrow becomes a tangled mess of responsibilities — and that's exactly the problem Firebase Genkit proposes to solve, acting as an orchestration layer that decouples business logic from AI infrastructure.
The central thesis is straightforward: an orchestration framework turns fragile LLM calls into predictable, typed, and observable architecture. The gain isn't in magical prompts, but in rigid contracts, mature abstractions, and infrastructure isolation.
Part 1 — Why isolated calls fail
The linear flow we imagine — App → LLM → Response — in practice, unfolds into a set of responsibilities that nobody asked for but all show up in production: context management, retry with backoff, broken JSON parsing, key rotation, document chunking, and vector store integration. Four structural problems emerge from this coupling:
Lack of typing — without rigid contracts, the model's response is unpredictable JSON that silently breaks the consumer.
Hardcoded prompts — the LLM's intent lives embedded in the application code, impossible to version or test in isolation.
Vendor lock-in — business logic gets tied to a specific proprietary model.
Zero observability — without visibility into latency and cost, every call is a financial black box.
The professional response is decompose the integration before it turns into technical debt: separate each concern into a module with its own contract. That's what Genkit organizes into three layers.
Part 2 — The three-layer operational model
Genkit organizes itself into three overlapping layers. The bottom one is provider-agnostic; the middle one concentrates reusable abstractions; the top one delivers the development experience.
The key idea of this architecture is write the logic once; swap the infrastructure with one line of code. The first-class runtime is JS/TS, but the framework also has SDKs in Go, Python, and Dart.
Typed Flows and rigid contracts
A Flow is an observable operation with input and output contracts defined by Zod schemas. The secret lies in the field output: { schema }: Genkit injects the schema rules into the model instruction (constrained generation) and validates the response back.
import { genkit, z } from 'genkit';import { googleAI } from '@genkit-ai/googleai';const ai = genkit({ plugins: [googleAI()] });// contrato de saída: nada de JSON soltoconst Relatorio = z.object({ titulo: z.string(), risco: z.enum(['baixo', 'medio', 'alto']), pontos: z.array(z.string()),});export const gerarRelatorio = ai.defineFlow( { name: 'gerarRelatorio', inputSchema: z.object({ texto: z.string() }), outputSchema: Relatorio, }, async ({ texto }) => { const { output } = await ai.generate({ model: 'googleai/gemini-1.5-flash', prompt: `Analise e resuma: ${texto}`, output: { schema: Relatorio }, }); return output!; });The enum in the field risk isn't decoration: it eliminates an entire class of bugs where the model would invent "moderate-high risk" and break the consumer's switch. Treat the LLM's output schema with the same rigor as a REST API contract.
Dotprompt: prompts as code
To get rid of hardcoded prompts, Genkit uses files .prompt versionable and testable. A YAML frontmatter declares configuration (model, temperature, schemas, tools) and a Handlebars template defines the messages, with embedded conditional logic.
---model: googleai/gemini-1.5-flashconfig: temperature: 0.4input: schema: UserSchemaoutput: schema: ReportSchematools: [buscarClima]---{{role "system"}}Você é um assistente técnico especializado.{{role "user"}}Analise os dados de {{usuario.nome}} e a imagem:{{media url=imagemPerfil}}{{#if isPremium}}Forneça resposta detalhada.{{/if}}Adjusting a prompt becomes a text diff, reviewable in a pull request, without recompiling the application.
Part 3 — RAG, reranking and tools
Ask an LLM about an internal term at your company and it hallucinates, because the data doesn't exist in its training. Injecting the entire document fixes the hallucination but consumes ~4,000 tokens per question. The architectural answer is the RAG (Retrieval-Augmented Generation): retrieve only the relevant snippets. The native pipeline has two paths that share the same vector store.
The concrete gain: consumption drops from ~4,000 to ~830 tokens per query, while maintaining accuracy. You pay the cost of vectorizing once, at ingestion, and reap cheap, precise responses on every query.
Reranking: two-stage retrieval
A vector retriever is fast but noisy. The two-stage retrieval pattern adds a reranker (cross-encoder) that reorders candidates by exact relevance and narrows the output to the top — only what has the highest statistical value occupies the context window. The first stage prioritizes recall; the second, precision.
Tools: from passive LLMs to agents
The Tools allow the model to request external data. You send the catalog of tools; when the model needs something it doesn't have, it pauses generation, Genkit intercepts and executes the tool locally, returns the result, and the model synthesizes the response. A tool is just a Zod-typed function that the model learns to invoke.
export const buscarClima = ai.defineTool( { name: 'buscarClima', description: 'Retorna o clima atual de uma cidade', inputSchema: z.object({ cidade: z.string() }), outputSchema: z.object({ tempC: z.number() }), }, async ({ cidade }) => { // código real: chama uma API de clima const r = await fetch(`/api/clima?c=${cidade}`); return { tempC: (await r.json()).temp }; });Part 4 — Production: multimodality, security and lifecycle
Raw text, media via Handlebars, HTTPS URLs, and Base64 PDFs enter through uniform abstractions: the framework handles conversion and encoding and assembles a homogeneous multimodal array. Before deploy, the command genkit start spins up a local Dev UI with a trace inspector (waterfall and timing of each step), model runner, and token counting.
At the edge, exposing generative endpoints without protection is severe financial risk. Keys go to Secret Manager (outside the code), authentication is native via Cloud Functions, and App Check blocks fraudulent requests at the perimeter (DeviceCheck / Play Integrity), ensuring that only genuine app instances incur charges.
Structural isolation also defines the deploy targets by language — and here is the direct bridge to the frontend: Next.js and Angular enter via Firebase App Hosting.
Language | Model (ex.) | Vector store | Deploy target |
|---|---|---|---|
JS / TS | Gemini | Cloud Firestore | Firebase App Hosting (Next.js / Angular) |
Go | Claude | Pinecone | Cloud Run (binary) |
Python | GPT-4o | pgvector | Cloud Functions |
Dart | Ollama (local) | Redis | Local telemetry |
Everything ties together in a four-phase cycle around Genkit:
Conclusion: orchestration instead of magic prompts
The arc of this material describes a maturation that other computing disciplines have already gone through: AI integration is leaving the artisanal phase — in which the result depended on nailing the prompt — and entering the engineering phase, in which the result depends on the quality of the system.
Each abstraction presented — typed Flows, Dotprompt, RAG, reranking, Tools, edge security — is an engineering response to a specific and reproducible failure of isolated LLM calls. Reliability in production is no longer a stroke of luck; it is an orchestrated and observable process.


