graph TD
%% Input processing
subgraph "📥 Input processing"
A[/Raw Data/] --> B[Loaders]
B --> B2[/Documents/]
B2 --> C[Splitters]
C --> D[/Chunks/]
end
%% Embedding & storage
subgraph "🔢 Embedding & storage"
D --> E[Embedding models]
E --> F[/Vectors/]
F --> G[(Vector stores)]
end
%% Retrieval
subgraph "🔍 Retrieval"
H[/User Query/] --> I["Embedding models (same)"]
I --> J[/Vector/]
J --> K[Retrievers]
K -- Vector --> G
G -- Chunks --> K
K --> L[/Relevant context/]
end
%% Generation
subgraph "Generation"
L -- Augment --> O[LLM]
O --> P[/Answer/]
end
Retrieval
Source: https://docs.langchain.com/oss/python/langchain/retrieval.
Large Language Models (LLMs) are powerful, but they have two key limitations:
- Static knowledge — their training data is frozen at a point in time.
- Finite context — they can’t ingest entire corpora at once.
Retrieval addresses these problems by fetching relevant external knowledge at query time. This is the foundation of Retrieval-Augmented Generation (RAG): enhancing an LLM’s answers with context-specific information.
Building a knowledge base
A knowledge base is a repository of documents or structured data used during retrieval.
If you already have a knowledge base (e.g., a SQL database, CRM, or internal documentation system), you do not need to rebuild it. You can either make:
- An Agentic RAG: make it a tool (the agent gets to decide whether to retrieve)
- A 2-Step RAG: query it and supply the retrieved content as context to the LLM (the agent doesn’t get to decide)
If you need a custom knowledge base, you can use LangChain’s document loaders and vector stores to build one from your own data.
Retrieval pipeline
A typical retrieval workflow looks like this:
Each component is modular: you can swap loaders, splitters, embeddings, or vector stores without rewriting the app’s logic.
Building blocks
1. Load
Document loaders Ingest data from external sources (Google Drive, Slack, Notion, etc.), returning standardized Document objects.
flowchart LR S[/"Sources<br>(Google Drive, Slack, Notion, etc.)"/] --> L[Document Loaders] L --> A[/Documents/]
2. Split
Text splitters Break large docs into smaller chunks that will be retrievable individually.
flowchart LR A[/Documents/] --> B[Splitter] B --> C1[/Chunk 1/] B --> C2[/Chunk 2/] B --> C3[/Chunk 3/]
Why do we split?..
B. Reducing “Lost in the Middle” Phenomena
Research shows that LLMs are best at recalling information located at the very beginning or the very end of a prompt. Information buried in the middle of a massive block of text is often ignored.
By providing the agent with several small, concise chunks, you ensure the relevant information stays at the “top of mind” for the model, leading to higher reasoning accuracy.
C. Cost and Latency
Every token sent to an LLM costs money and takes time to process.
- Efficiency: If an agent needs to know a specific date in a 100MB file, sending the whole file is expensive and slow.
- Speed: Sending three 500-token chunks is nearly instantaneous and costs a fraction of the price, allowing the agent to move through its task pipeline much faster.
3. Embed
Embedding models An embedding model turns text into a vector of numbers so that texts with similar meaning land close together in that vector space.
flowchart LR D1[Embedding Model] D2[Embedding Model] D3[Embedding Model] C1[/Chunk 1/] --> D1 --> E1[/Vector 1/] C2[/Chunk 2/] --> D2 --> E2[/Vector 2/] C3[/Chunk 3/] --> D3 --> E3[/Vector 3/]
4. Store
Vector stores Specialized databases for storing and searching embeddings.
flowchart LR
S[(Vector Store)]
E1[/Vector 1/] --> S
E2[/Vector 2/] --> S
E3[/Vector 3/] --> S
5. Retrieve
Retrievers A retriever is an interface that returns documents given an unstructured query.
flowchart LR
R[Retriever] --> S[(Vector Store)]
S --> R
R --> C1[/Chunk 14/]
R --> C2[/Chunk 8/]
R --> C3[/Chunk 62/]
RAG architectures
RAG can be implemented in multiple ways, depending on your system’s needs. We outline each type in the sections below.
| Architecture | Description | Control | Flexibility | Latency | Example Use Case |
|---|---|---|---|---|---|
| 2-Step RAG | Retrieval always happens before generation. Simple and predictable | ✅ High | ❌ Low | ⚡ Fast | FAQs, documentation bots |
| Agentic RAG | An LLM-powered agent decides when and how to retrieve during reasoning | ❌ Low | ✅ High | ⏳ Variable | Research assistants with access to multiple tools |
| Hybrid | Combines characteristics of both approaches with validation steps | ⚖️ Medium | ⚖️ Medium | ⏳ Variable | Domain-specific Q&A with quality validation |
Note 1: Latency: Latency is generally more predictable in 2-Step RAG, as the maximum number of LLM calls is known and capped.
Note 2: real-world latency may also be affected by the performance of retrieval steps—such as API response times, network delays, or database queries—which can vary based on the tools and infrastructure in use.
A. 2-step RAG
In 2-Step RAG, the retrieval step is always executed before the generation step. This architecture is straightforward and predictable, making it suitable for many applications where the retrieval of relevant documents is a clear prerequisite for generating an answer.
graph LR
A[User Question] --> B["Retrieve Relevant Documents"]
B --> C["Generate Answer"]
C --> D[Return Answer to User]
%% Styling
classDef startend fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff
classDef process fill:#1976d2,stroke:#0d47a1,stroke-width:1.5px,color:#fff
class A,D startend
class B,C process
B. Agentic RAG
Agentic Retrieval-Augmented Generation (RAG) combines the strengths of Retrieval-Augmented Generation with agent-based reasoning. Instead of retrieving documents before answering, an agent (powered by an LLM) reasons step-by-step and decides when and how to retrieve information during the interaction.
graph TD
A[User Input / Question] --> B["Agent (LLM)"]
B --> C{Need external info?}
C -- Yes --> D["Search using tool(s)"]
D --> H{Enough to answer?}
H -- No --> B
H -- Yes --> I[Generate final answer]
C -- No --> I
I --> J[Return to user]
%% Dark-mode friendly styling
classDef startend fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff
classDef decision fill:#f9a825,stroke:#f57f17,stroke-width:2px,color:#000
classDef process fill:#1976d2,stroke:#0d47a1,stroke-width:1.5px,color:#fff
class A,J startend
class B,D,I process
class C,H decision
The only thing an agent needs to enable RAG behavior is access to one or more tools that can fetch external knowledge — such as documentation loaders, web APIs, or database queries.
import requests
from langchain.tools import tool
from langchain.chat_models import init_chat_model
from langchain.agents import create_agent
@tool
def fetch_url(url: str) -> str:
"""Fetch text content from a URL"""
response = requests.get(url, timeout=10.0)
response.raise_for_status()
return response.text
system_prompt = """\
Use fetch_url when you need to fetch information from a web-page; quote relevant snippets.
"""
agent = create_agent(
model="claude-sonnet-4-5-20250929",
tools=[fetch_url], # A tool for retrieval
system_prompt=system_prompt,
)C. Hybrid RAG
Hybrid RAG combines characteristics of both 2-Step and Agentic RAG. It introduces intermediate steps such as query preprocessing, retrieval validation, and post-generation checks. These systems offer more flexibility than fixed pipelines while maintaining some control over execution.
Typical components include:
- Query enhancement: Modify the input question to improve retrieval quality.
- Rewrite unclear queries, generate multiple variations, or expand queries with additional context.
- Retrieval validation: Evaluate whether retrieved documents are relevant and sufficient.
- If not, the system may refine the query and retrieve again.
- Answer validation: Check the generated answer for accuracy, completeness, and alignment with source content.
- If needed, the system can regenerate or revise the answer.
The architecture often supports multiple iterations between these steps:
graph TD
A[User Question] --> B[Query Enhancement]
B --> C[Retrieve Documents]
C --> D{Sufficient Info?}
D -- No --> E[Refine Query]
E --> C
D -- Yes --> F[Generate Answer]
F --> G{Answer Quality OK?}
G -- No --> H{Try Different Approach?}
H -- Yes --> E
H -- No --> I[Return Best Answer]
G -- Yes --> I
I --> J[Return to User]
classDef startend fill:#2e7d32,stroke:#1b5e20,stroke-width:2px,color:#fff
classDef decision fill:#f9a825,stroke:#f57f17,stroke-width:2px,color:#000
classDef process fill:#1976d2,stroke:#0d47a1,stroke-width:1.5px,color:#fff
class A,J startend
class B,C,E,F,I process
class D,G,H decision
This architecture is suitable for:
- Applications with ambiguous or underspecified queries
- Systems that require validation or quality control steps
- Workflows involving multiple sources or iterative refinement
Key Takeaways
- Retrieval solves two LLM limits:
- Static (frozen) knowledge and finite context.
- Fetching relevant external information at query time is the foundation of RAG (Retrieval-Augmented Generation).
- You don’t always need a new knowledge base:
- If you already have one (SQL, CRM, docs), use it as a tool (Agentic RAG) or as fixed context (2-Step RAG).
- For custom data, use loaders and vector stores.
- The retrieval pipeline is modular:
- Load (document loaders) → Split (text splitters) → Embed (embedding models) → Store (vector stores) → Retrieve (retrievers).
- You can swap any component without changing the rest of the app.
- Splitting matters for:
- Staying within context windows.
- Avoiding “lost in the middle” (LLMs recall start/end of input better).
- Lowering cost and latency.
- RAG comes in three main forms:
- 2-Step RAG: always retrieve then generate; high control, fast, predictable.
- Agentic RAG: the agent decides when and how to retrieve; high flexibility, variable latency.
- Hybrid RAG: query enhancement, retrieval validation, answer validation; balanced control and flexibility.