Orchestrating AI in production with Firebase Genkit

Integrating AI seems simple: call an LLM API and read the response. That simplicity is a illusion. In production, that single arrow becomes a tangled mess of responsibilities — and that's exactly the problem Firebase Genkit proposes to solve, acting as an orchestration layer that decouples business logic from AI infrastructure.

The central thesis is straightforward: an orchestration framework turns fragile LLM calls into predictable, typed, and observable architecture. The gain isn't in magical prompts, but in rigid contracts, mature abstractions, and infrastructure isolation.

Part 1 — Why isolated calls fail

The linear flow we imagine — App → LLM → Response — in practice, unfolds into a set of responsibilities that nobody asked for but all show up in production: context management, retry with backoff, broken JSON parsing, key rotation, document chunking, and vector store integration. Four structural problems emerge from this coupling:

Lack of typing — without rigid contracts, the model's response is unpredictable JSON that silently breaks the consumer.
Hardcoded prompts — the LLM's intent lives embedded in the application code, impossible to version or test in isolation.
Vendor lock-in — business logic gets tied to a specific proprietary model.
Zero observability — without visibility into latency and cost, every call is a financial black box.

The professional response is decompose the integration before it turns into technical debt: separate each concern into a module with its own contract. That's what Genkit organizes into three layers.

Part 2 — The three-layer operational model

Genkit organizes itself into three overlapping layers. The bottom one is provider-agnostic; the middle one concentrates reusable abstractions; the top one delivers the development experience.

The three layers of Genkit

The key idea of this architecture is write the logic once; swap the infrastructure with one line of code. The first-class runtime is JS/TS, but the framework also has SDKs in Go, Python, and Dart.

Typed Flows and rigid contracts

A Flow is an observable operation with input and output contracts defined by Zod schemas. The secret lies in the field output: { schema }: Genkit injects the schema rules into the model instruction (constrained generation) and validates the response back.

TYPESCRIPT

import { genkit, z } from 'genkit';import { googleAI } from '@genkit-ai/googleai';const ai = genkit({ plugins: [googleAI()] });// contrato de saída: nada de JSON soltoconst Relatorio = z.object({  titulo: z.string(),  risco: z.enum(['baixo', 'medio', 'alto']),  pontos: z.array(z.string()),});export const gerarRelatorio = ai.defineFlow(  {    name: 'gerarRelatorio',    inputSchema: z.object({ texto: z.string() }),    outputSchema: Relatorio,  },  async ({ texto }) => {    const { output } = await ai.generate({      model: 'googleai/gemini-1.5-flash',      prompt: `Analise e resuma: ${texto}`,      output: { schema: Relatorio },    });    return output!;  });

The enum in the field risk isn't decoration: it eliminates an entire class of bugs where the model would invent "moderate-high risk" and break the consumer's switch. Treat the LLM's output schema with the same rigor as a REST API contract.

Dotprompt: prompts as code

To get rid of hardcoded prompts, Genkit uses files .prompt versionable and testable. A YAML frontmatter declares configuration (model, temperature, schemas, tools) and a Handlebars template defines the messages, with embedded conditional logic.

YAML

---model: googleai/gemini-1.5-flashconfig:  temperature: 0.4input:  schema: UserSchemaoutput:  schema: ReportSchematools: [buscarClima]---{{role "system"}}Você é um assistente técnico especializado.{{role "user"}}Analise os dados de {{usuario.nome}} e a imagem:{{media url=imagemPerfil}}{{#if isPremium}}Forneça resposta detalhada.{{/if}}

Adjusting a prompt becomes a text diff, reviewable in a pull request, without recompiling the application.

Part 3 — RAG, reranking and tools

Ask an LLM about an internal term at your company and it hallucinates, because the data doesn't exist in its training. Injecting the entire document fixes the hallucination but consumes ~4,000 tokens per question. The architectural answer is the RAG (Retrieval-Augmented Generation): retrieve only the relevant snippets. The native pipeline has two paths that share the same vector store.

Genkit's native RAG pipeline

The concrete gain: consumption drops from ~4,000 to ~830 tokens per query, while maintaining accuracy. You pay the cost of vectorizing once, at ingestion, and reap cheap, precise responses on every query.

Reranking: two-stage retrieval

A vector retriever is fast but noisy. The two-stage retrieval pattern adds a reranker (cross-encoder) that reorders candidates by exact relevance and narrows the output to the top — only what has the highest statistical value occupies the context window. The first stage prioritizes recall; the second, precision.

Two-stage retrieval with reranking

Tools: from passive LLMs to agents

The Tools allow the model to request external data. You send the catalog of tools; when the model needs something it doesn't have, it pauses generation, Genkit intercepts and executes the tool locally, returns the result, and the model synthesizes the response. A tool is just a Zod-typed function that the model learns to invoke.

TYPESCRIPT

export const buscarClima = ai.defineTool(  {    name: 'buscarClima',    description: 'Retorna o clima atual de uma cidade',    inputSchema: z.object({ cidade: z.string() }),    outputSchema: z.object({ tempC: z.number() }),  },  async ({ cidade }) => {    // código real: chama uma API de clima    const r = await fetch(`/api/clima?c=${cidade}`);    return { tempC: (await r.json()).temp };  });

Part 4 — Production: multimodality, security and lifecycle

Raw text, media via Handlebars, HTTPS URLs, and Base64 PDFs enter through uniform abstractions: the framework handles conversion and encoding and assembles a homogeneous multimodal array. Before deploy, the command genkit start spins up a local Dev UI with a trace inspector (waterfall and timing of each step), model runner, and token counting.

At the edge, exposing generative endpoints without protection is severe financial risk. Keys go to Secret Manager (outside the code), authentication is native via Cloud Functions, and App Check blocks fraudulent requests at the perimeter (DeviceCheck / Play Integrity), ensuring that only genuine app instances incur charges.

Structural isolation also defines the deploy targets by language — and here is the direct bridge to the frontend: Next.js and Angular enter via Firebase App Hosting.

Language	Model (ex.)	Vector store	Deploy target
JS / TS	Gemini	Cloud Firestore	Firebase App Hosting (Next.js / Angular)
Go	Claude	Pinecone	Cloud Run (binary)
Python	GPT-4o	pgvector	Cloud Functions
Dart	Ollama (local)	Redis	Local telemetry

Everything ties together in a four-phase cycle around Genkit:

Genkit operational cycle

Conclusion: orchestration instead of magic prompts

The arc of this material describes a maturation that other computing disciplines have already gone through: AI integration is leaving the artisanal phase — in which the result depended on nailing the prompt — and entering the engineering phase, in which the result depends on the quality of the system.

Each abstraction presented — typed Flows, Dotprompt, RAG, reranking, Tools, edge security — is an engineering response to a specific and reproducible failure of isolated LLM calls. Reliability in production is no longer a stroke of luck; it is an orchestrated and observable process.

Part 1 — Why isolated calls fail

Lack of typing — without rigid contracts, the model's response is unpredictable JSON that silently breaks the consumer.
Hardcoded prompts — the LLM's intent lives embedded in the application code, impossible to version or test in isolation.
Vendor lock-in — business logic gets tied to a specific proprietary model.
Zero observability — without visibility into latency and cost, every call is a financial black box.

Part 2 — The three-layer operational model

Genkit organizes itself into three overlapping layers. The bottom one is provider-agnostic; the middle one concentrates reusable abstractions; the top one delivers the development experience.

The three layers of Genkit

The key idea of this architecture is write the logic once; swap the infrastructure with one line of code. The first-class runtime is JS/TS, but the framework also has SDKs in Go, Python, and Dart.

Typed Flows and rigid contracts

TYPESCRIPT

import { genkit, z } from 'genkit';import { googleAI } from '@genkit-ai/googleai';const ai = genkit({ plugins: [googleAI()] });// contrato de saída: nada de JSON soltoconst Relatorio = z.object({  titulo: z.string(),  risco: z.enum(['baixo', 'medio', 'alto']),  pontos: z.array(z.string()),});export const gerarRelatorio = ai.defineFlow(  {    name: 'gerarRelatorio',    inputSchema: z.object({ texto: z.string() }),    outputSchema: Relatorio,  },  async ({ texto }) => {    const { output } = await ai.generate({      model: 'googleai/gemini-1.5-flash',      prompt: `Analise e resuma: ${texto}`,      output: { schema: Relatorio },    });    return output!;  });

Dotprompt: prompts as code

YAML

---model: googleai/gemini-1.5-flashconfig:  temperature: 0.4input:  schema: UserSchemaoutput:  schema: ReportSchematools: [buscarClima]---{{role "system"}}Você é um assistente técnico especializado.{{role "user"}}Analise os dados de {{usuario.nome}} e a imagem:{{media url=imagemPerfil}}{{#if isPremium}}Forneça resposta detalhada.{{/if}}

Adjusting a prompt becomes a text diff, reviewable in a pull request, without recompiling the application.

Part 3 — RAG, reranking and tools

Genkit's native RAG pipeline

Reranking: two-stage retrieval

Two-stage retrieval with reranking

Tools: from passive LLMs to agents

TYPESCRIPT

export const buscarClima = ai.defineTool(  {    name: 'buscarClima',    description: 'Retorna o clima atual de uma cidade',    inputSchema: z.object({ cidade: z.string() }),    outputSchema: z.object({ tempC: z.number() }),  },  async ({ cidade }) => {    // código real: chama uma API de clima    const r = await fetch(`/api/clima?c=${cidade}`);    return { tempC: (await r.json()).temp };  });

Part 4 — Production: multimodality, security and lifecycle

Structural isolation also defines the deploy targets by language — and here is the direct bridge to the frontend: Next.js and Angular enter via Firebase App Hosting.

Language	Model (ex.)	Vector store	Deploy target
JS / TS	Gemini	Cloud Firestore	Firebase App Hosting (Next.js / Angular)
Go	Claude	Pinecone	Cloud Run (binary)
Python	GPT-4o	pgvector	Cloud Functions
Dart	Ollama (local)	Redis	Local telemetry

Everything ties together in a four-phase cycle around Genkit:

Genkit operational cycle

Orchestrating AI in production with Firebase Genkit

Part 1 — Why isolated calls fail

Part 2 — The three-layer operational model

Typed Flows and rigid contracts

Dotprompt: prompts as code

Part 3 — RAG, reranking and tools

Reranking: two-stage retrieval

Tools: from passive LLMs to agents

Part 4 — Production: multimodality, security and lifecycle

Conclusion: orchestration instead of magic prompts

The elite dev's arsenal.

Conversational AI Exhausted? How to Migrate to Agentic Workflows and Execute Real Actions

GLM-5.2 vs. Kimi K2.7: Why GLM Wins the Code Reliability Test

How to choose an AI SDK: why the fear of lock-in is a mistake and how to decide based on your app's format

Orchestrating AI in production with Firebase Genkit

Part 1 — Why isolated calls fail

Part 2 — The three-layer operational model

Typed Flows and rigid contracts

Dotprompt: prompts as code

Part 3 — RAG, reranking and tools

Reranking: two-stage retrieval

Tools: from passive LLMs to agents

Part 4 — Production: multimodality, security and lifecycle

Conclusion: orchestration instead of magic prompts

The elite dev's arsenal.

Conversational AI Exhausted? How to Migrate to Agentic Workflows and Execute Real Actions

GLM-5.2 vs. Kimi K2.7: Why GLM Wins the Code Reliability Test

How to choose an AI SDK: why the fear of lock-in is a mistake and how to decide based on your app's format