Retrieval

Source: https://docs.langchain.com/oss/python/langchain/retrieval.

Large Language Models (LLMs) are powerful, but they have two key limitations:

Retrieval addresses these problems by fetching relevant external knowledge at query time. This is the foundation of Retrieval-Augmented Generation (RAG): enhancing an LLM’s answers with context-specific information.

Building a knowledge base

A knowledge base is a repository of documents or structured data used during retrieval.

If you already have a knowledge base (e.g., a SQL database, CRM, or internal documentation system), you do not need to rebuild it. You can either make:

  1. An Agentic RAG: make it a tool (the agent gets to decide whether to retrieve)
  2. A 2-Step RAG: query it and supply the retrieved content as context to the LLM (the agent doesn’t get to decide)

If you need a custom knowledge base, you can use LangChain’s document loaders and vector stores to build one from your own data.

Retrieval pipeline

A typical retrieval workflow looks like this:

graph TD
    %% Input processing
    subgraph "📥 Input processing"
        A[/Raw Data/] --> B[Loaders]
        B --> B2[/Documents/]
        B2 --> C[Splitters]
        C --> D[/Chunks/]
    end

    %% Embedding & storage
    subgraph "🔢 Embedding & storage"
        D --> E[Embedding models]
        E --> F[/Vectors/]
        F --> G[(Vector stores)]
    end

    %% Retrieval
    subgraph "🔍 Retrieval"
        H[/User Query/] --> I["Embedding models (same)"]
        I --> J[/Vector/]
        J --> K[Retrievers]
        K -- Vector --> G
        G -- Chunks --> K
        K --> L[/Relevant context/]
    end

    %% Generation
    subgraph "Generation"
        L -- Augment --> O[LLM]
        O --> P[/Answer/]
    end

Each component is modular: you can swap loaders, splitters, embeddings, or vector stores without rewriting the app’s logic.

Building blocks

1. Load

Document loaders Ingest data from external sources (Google Drive, Slack, Notion, etc.), returning standardized Document objects.

flowchart LR
  S[/"Sources<br>(Google Drive, Slack, Notion, etc.)"/] --> L[Document Loaders]
  L --> A[/Documents/]

2. Split

Text splitters Break large docs into smaller chunks that will be retrievable individually.

flowchart LR
  A[/Documents/] --> B[Splitter]
  B --> C1[/Chunk 1/]
  B --> C2[/Chunk 2/]
  B --> C3[/Chunk 3/]

Why do we split?..

A. Navigating Context Window Limits

Even the most advanced Large Language Models (LLMs) have a context window—a maximum number of tokens (words/characters) they can “see” at one time.

  • The Problem: Many documents (legal contracts, technical manuals, books) are significantly larger than an LLM’s context window.
  • The Solution: By splitting a 500-page manual into 500-word chunks, an agent can “pull” only the relevant pieces into its memory to answer a specific query without hitting a technical ceiling.

B. Reducing “Lost in the Middle” Phenomena

Research shows that LLMs are best at recalling information located at the very beginning or the very end of a prompt. Information buried in the middle of a massive block of text is often ignored.

By providing the agent with several small, concise chunks, you ensure the relevant information stays at the “top of mind” for the model, leading to higher reasoning accuracy.

C. Cost and Latency

Every token sent to an LLM costs money and takes time to process.

  • Efficiency: If an agent needs to know a specific date in a 100MB file, sending the whole file is expensive and slow.
  • Speed: Sending three 500-token chunks is nearly instantaneous and costs a fraction of the price, allowing the agent to move through its task pipeline much faster.

3. Embed

Embedding models An embedding model turns text into a vector of numbers so that texts with similar meaning land close together in that vector space.

flowchart LR
  D1[Embedding Model]
  D2[Embedding Model]
  D3[Embedding Model]
  C1[/Chunk 1/] --> D1 --> E1[/Vector 1/]
  C2[/Chunk 2/] --> D2 --> E2[/Vector 2/]
  C3[/Chunk 3/] --> D3 --> E3[/Vector 3/]

4. Store

Vector stores Specialized databases for storing and searching embeddings.

flowchart LR
    S[(Vector Store)]
    E1[/Vector 1/] --> S
    E2[/Vector 2/] --> S
    E3[/Vector 3/] --> S

5. Retrieve

Retrievers A retriever is an interface that returns documents given an unstructured query.

flowchart LR
    R[Retriever] --> S[(Vector Store)]
    S --> R
    R --> C1[/Chunk 14/]
    R --> C2[/Chunk 8/]
    R --> C3[/Chunk 62/]

RAG architectures

RAG can be implemented in multiple ways, depending on your system’s needs. We outline each type in the sections below.

Architecture Description Control Flexibility Latency Example Use Case
2-Step RAG Retrieval always happens before generation. Simple and predictable ✅ High ❌ Low ⚡ Fast FAQs, documentation bots
Agentic RAG An LLM-powered agent decides when and how to retrieve during reasoning ❌ Low ✅ High ⏳ Variable Research assistants with access to multiple tools
Hybrid Combines characteristics of both approaches with validation steps ⚖️ Medium ⚖️ Medium ⏳ Variable Domain-specific Q&A with quality validation

Note 1: Latency: Latency is generally more predictable in 2-Step RAG, as the maximum number of LLM calls is known and capped.

Note 2: real-world latency may also be affected by the performance of retrieval steps—such as API response times, network delays, or database queries—which can vary based on the tools and infrastructure in use.

A. 2-step RAG

In 2-Step RAG, the retrieval step is always executed before the generation step. This architecture is straightforward and predictable, making it suitable for many applications where the retrieval of relevant documents is a clear prerequisite for generating an answer.

graph LR
    A[User Question] --> B["Retrieve Relevant Documents"]
    B --> C["Generate Answer"]
    C --> D[Return Answer to User]

    %% Styling
    classDef startend fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff
    classDef process fill:#1976d2,stroke:#0d47a1,stroke-width:1.5px,color:#fff

    class A,D startend
    class B,C process

B. Agentic RAG

Agentic Retrieval-Augmented Generation (RAG) combines the strengths of Retrieval-Augmented Generation with agent-based reasoning. Instead of retrieving documents before answering, an agent (powered by an LLM) reasons step-by-step and decides when and how to retrieve information during the interaction.

graph TD
    A[User Input / Question] --> B["Agent (LLM)"]
    B --> C{Need external info?}
    C -- Yes --> D["Search using tool(s)"]
    D --> H{Enough to answer?}
    H -- No --> B
    H -- Yes --> I[Generate final answer]
    C -- No --> I
    I --> J[Return to user]

    %% Dark-mode friendly styling
    classDef startend fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff
    classDef decision fill:#f9a825,stroke:#f57f17,stroke-width:2px,color:#000
    classDef process fill:#1976d2,stroke:#0d47a1,stroke-width:1.5px,color:#fff

    class A,J startend
    class B,D,I process
    class C,H decision

The only thing an agent needs to enable RAG behavior is access to one or more tools that can fetch external knowledge — such as documentation loaders, web APIs, or database queries.

import requests
from langchain.tools import tool
from langchain.chat_models import init_chat_model
from langchain.agents import create_agent


@tool
def fetch_url(url: str) -> str:
    """Fetch text content from a URL"""
    response = requests.get(url, timeout=10.0)
    response.raise_for_status()
    return response.text

system_prompt = """\
Use fetch_url when you need to fetch information from a web-page; quote relevant snippets.
"""

agent = create_agent(
    model="claude-sonnet-4-5-20250929",
    tools=[fetch_url], # A tool for retrieval
    system_prompt=system_prompt,
)

C. Hybrid RAG

Hybrid RAG combines characteristics of both 2-Step and Agentic RAG. It introduces intermediate steps such as query preprocessing, retrieval validation, and post-generation checks. These systems offer more flexibility than fixed pipelines while maintaining some control over execution.

Typical components include:

  • Query enhancement: Modify the input question to improve retrieval quality.
    • Rewrite unclear queries, generate multiple variations, or expand queries with additional context.
  • Retrieval validation: Evaluate whether retrieved documents are relevant and sufficient.
    • If not, the system may refine the query and retrieve again.
  • Answer validation: Check the generated answer for accuracy, completeness, and alignment with source content.
    • If needed, the system can regenerate or revise the answer.

The architecture often supports multiple iterations between these steps:

graph TD
    A[User Question] --> B[Query Enhancement]
    B --> C[Retrieve Documents]
    C --> D{Sufficient Info?}
    D -- No --> E[Refine Query]
    E --> C
    D -- Yes --> F[Generate Answer]
    F --> G{Answer Quality OK?}
    G -- No --> H{Try Different Approach?}
    H -- Yes --> E
    H -- No --> I[Return Best Answer]
    G -- Yes --> I
    I --> J[Return to User]

    classDef startend fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff
    classDef decision fill:#f9a825,stroke:#f57f17,stroke-width:2px,color:#000
    classDef process fill:#1976d2,stroke:#0d47a1,stroke-width:1.5px,color:#fff

    class A,J startend
    class B,C,E,F,I process
    class D,G,H decision

This architecture is suitable for:

  • Applications with ambiguous or underspecified queries
  • Systems that require validation or quality control steps
  • Workflows involving multiple sources or iterative refinement

Key Takeaways

  • Retrieval solves two LLM limits:
    • Static (frozen) knowledge and finite context.
    • Fetching relevant external information at query time is the foundation of RAG (Retrieval-Augmented Generation).
  • You don’t always need a new knowledge base:
    • If you already have one (SQL, CRM, docs), use it as a tool (Agentic RAG) or as fixed context (2-Step RAG).
    • For custom data, use loaders and vector stores.
  • The retrieval pipeline is modular:
    • Load (document loaders) → Split (text splitters) → Embed (embedding models) → Store (vector stores) → Retrieve (retrievers).
    • You can swap any component without changing the rest of the app.
  • Splitting matters for:
    • Staying within context windows.
    • Avoiding “lost in the middle” (LLMs recall start/end of input better).
    • Lowering cost and latency.
  • RAG comes in three main forms:
    • 2-Step RAG: always retrieve then generate; high control, fast, predictable.
    • Agentic RAG: the agent decides when and how to retrieve; high flexibility, variable latency.
    • Hybrid RAG: query enhancement, retrieval validation, answer validation; balanced control and flexibility.