Build a Production-Ready RAG Pipeline with Vertex AI and Gemini 1.5



In the enterprise AI landscape, the "knowledge cutoff" is the enemy. Large Language Models (LLMs) like Gemini are incredible reasoning engines, but they are poor databases. If you ask a standard model about your company's private Q3 financial data or a policy changed yesterday, it will either hallucinate or fail.

Retrieval-Augmented Generation (RAG) bridges this gap by retrieving relevant data from your external sources and inserting it into the model's context window before it generates an answer.

This guide explores the architecture and provides a complete Python implementation using Vertex AI Vector Search and Gemini 1.5 Pro.

The Architecture: How RAG Works

A standard Google Cloud RAG pipeline follows four distinct stages:

  1. Ingestion: Breaking down documents (PDFs, Wikis, DBs) into chunks.

  2. Embedding: Using models like text-embedding-004 to convert text into vector representations.

  3. Vector Retrieval: Storing vectors in Vertex AI Vector Search (formerly Matching Engine) for millisecond-latency similarity searching.

  4. Grounded Generation: Combining the retrieved context with the user prompt and sending it to Gemini.


The Tech Stack

  • Orchestration: Python (Google Cloud SDK)

  • Embeddings: text-embedding-004 (optimized for semantic retrieval)

  • Vector Database: Vertex AI Vector Search (built on Google's ScaNN algorithm)

  • LLM: Gemini 1.5 Pro


The Implementation

Below is the complete, consolidated workflow. This script assumes you have already created a Vector Search Index and deployed it to an Endpoint in the Google Cloud Console.

Prerequisites

You will need the google-cloud-aiplatform library:

Bash
pip install google-cloud-aiplatform

The Full Python Script

This script demonstrates how to embed a query, retrieve neighbors from your index, and pass them to Gemini to generate an answer.

Python
import time
from typing import List

import vertexai
from vertexai.language_models import TextEmbeddingModel
from vertexai.generative_models import GenerativeModel
from google.cloud import aiplatform

# --- Configuration ---
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
INDEX_ENDPOINT_ID = "your-index-endpoint-id" # From Vertex AI Console
DEPLOYED_INDEX_ID = "your-deployed-index-id" # User-defined ID given during deployment

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION)

def get_embedding(text: str) -> List[float]:
    """Generates a vector embedding for a given string."""
    model = TextEmbeddingModel.from_pretrained("text-embedding-004")
    embeddings = model.get_embeddings([text])
    # Returns the 768-dimensional vector
    return embeddings[0].values

def retrieve_context(query: str) -> List[str]:
    """
    Embeds the query and searches the Vector Search Index 
    for the nearest neighbors.
    """
    # 1. Embed the query
    query_vector = get_embedding(query)

    # 2. Initialize the Vector Search Endpoint
    index_endpoint = aiplatform.MatchingEngineIndexEndpoint(
        index_endpoint_name=f"projects/{PROJECT_ID}/locations/{LOCATION}/indexEndpoints/{INDEX_ENDPOINT_ID}"
    )

    # 3. Query the index
    # We request the closest 5 vectors (neighbors)
    response = index_endpoint.find_neighbors(
        deployed_index_id=DEPLOYED_INDEX_ID,
        queries=[query_vector],
        num_neighbors=5
    )

    # 4. Extract context (Simulated for this script)
    # In a real app, 'neighbor.id' is a key to a distinct database (like BigQuery or Firestore)
    # where the actual text is stored. Vector Search stores VECTORS, not TEXT.
    retrieved_text = []
    
    print(f"Found {len(response[0])} relevant chunks.")
    
    for neighbor in response[0]:
        # SIMULATION: We map the ID back to text. 
        # Replace this with: db.get(neighbor.id)
        retrieved_text.append(f"Content for document ID {neighbor.id}")

    return retrieved_text

def generate_grounded_response(query: str, context_chunks: List[str]):
    """
    Feeds the context and query into Gemini.
    """
    model = GenerativeModel("gemini-1.5-pro")

    # Construct the grounded prompt
    context_str = "\n".join(context_chunks)
    
    prompt = f"""
    You are an expert assistant. Use the provided context to answer the user's question.
    If the answer is not known based on the context, state that you do not know.

    CONTEXT:
    {context_str}

    USER QUESTION:
    {query}
    """

    print("--- Sending Prompt to Gemini ---")
    response = model.generate_content(prompt)
    return response.text

# --- Main Execution Flow ---
if __name__ == "__main__":
    user_query = "What are the latency requirements for the new API?"

    print(f"Processing Query: {user_query}...")
    
    # Step 1 & 2: Retrieve relevant information
    context = retrieve_context(user_query)
    
    # Step 3: Generate Answer
    answer = generate_grounded_response(user_query, context)
    
    print("\n--- Gemini Response ---")
    print(answer)

Technical Deep Dive: Why this Matters

1. The Separation of Concerns

You might notice a comment in the retrieve_context function regarding fetching text. Vertex AI Vector Search is purely for math. It stores vectors and IDs, not the raw text.

This is a feature, not a bug. It allows the search engine to be incredibly lightweight and fast. A common pattern is:

  • Vector Search: Finds ID: 123 is relevant.

  • Firestore/Redis: Looks up ID: 123 to get the actual paragraph of text.

2. "Grounded" Generation

By strictly instructing the model via the system prompt (Use the provided context...), we reduce hallucinations. If the vector search returns irrelevant results, a well-tuned prompt will cause Gemini to reply "I don't know," which is safer for enterprise use cases than a made-up answer.

3. Latency Optimization

Using text-embedding-004 ensures we are using Google's latest, most efficient multilingual model. Furthermore, Vertex AI Vector Search offers "Streaming Updates," meaning you can add new documents to your index in seconds, unlike traditional search engines that require nightly re-indexing.


Conclusion

Building a RAG pipeline transforms Gemini from a creative writer into a knowledgeable subject matter expert. By combining the reasoning power of Gemini 1.5 Pro with the retrieval speed of Vertex AI, you can build applications that are accurate, scalable, and secure.


Comments