Open In Colab

Semantic Search: Indexing and Retrieval

Source: Build a semantic search engine with LangChain | LangChain.

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

This guide focuses on LangChain’s abstractions related to indexing and retrieval of text data:

  1. Indexing
    1. Input processing – Transform raw data into structured documents
    2. Embedding & storage – Convert text into searchable vector representations
  2. Retrieval – Find relevant information based on user queries

This Relevant context would later be augmented (added) into an LLM’s context window, allowing the LLM to answer questions based on data. The term for this kind of interaction is called: Retrieval Augmented Generation (RAG).

Setup

Installation

This tutorial requires the langchain-community and pypdf packages. Using uv:

%pip install langchain-community pypdf

For more details, see our Installation guide.

1. Documents and document loaders

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

  • page_content: a string representing the content;
  • metadata: a dict containing arbitrary metadata;
  • id: (optional) a string identifier for the document.

The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document object often represents a chunk of a larger document.

We can generate sample documents when desired:

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

However, the LangChain ecosystem implements document loaders that integrate with hundreds of common sources. This makes it easy to incorporate data from these sources into your AI application.

Loading documents

Let’s load a PDF into a sequence of Document objects. First, fetch a PDF from arXiv (e.g. the “Attention Is All You Need” paper) using curl:

curl -L -o paper.pdf "https://arxiv.org/pdf/1706.03762.pdf"

Then load it with the PyPDFLoader:

from langchain_community.document_loaders import PyPDFLoader

file_path = "paper.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))
15

PyPDFLoader loads one Document object per PDF page. For each, we can easily access:

  • The string content of the page;
  • Metadata containing the file name and page number.
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need


{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}

Splitting

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve Document objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not “washed out” by surrounding text.

We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index where each split Document starts within the initial Document is preserved as metadata attribute start_index.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(len(all_splits))
52

2. Embeddings

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. We use HuggingFace with the all-mpnet-base-v2 sentence transformer:

%pip install langchain-huggingface sentence-transformers
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
/home/halgoz/work/ai-agents/content/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])
Generated vectors of length 768

[0.00369695364497602, 0.017323777079582214, -0.01369203720241785, 0.00036987385828979313, -0.05261596664786339, -0.0010653170756995678, 0.002715452341362834, -0.02163930982351303, -0.06304091215133667, -0.0036578048020601273]

Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

3. Vector stores

LangChain VectorStore objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.

LangChain includes a suite of integrations with different vector store technologies. For this tutorial we use the in-memory vector store, which is lightweight and requires no extra infrastructure:

from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

Having instantiated our vector store, we can now index the documents.

ids = vector_store.add_documents(documents=all_splits)

Note that most vector store implementations will allow you to connect to an existing vector store– e.g., by providing a client, index name, or other information. See the documentation for a specific integration for more detail.

Once we’ve instantiated a VectorStore that contains documents, we can query it. VectorStore includes methods for querying:

  • Synchronously and asynchronously;
  • By string query and by vector;
  • With and without returning similarity scores;
  • By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).

The methods will generally include a list of Document objects in their outputs.

Usage

Embeddings typically represent text as a “dense” vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.

Return documents based on similarity to a string query:

results = vector_store.similarity_search(
    "What is the main architecture proposed in the paper?"
)

print(results[0])
page_content='itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,' metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 770}

Async query:

results = await vector_store.asimilarity_search("What is self-attention?")

print(results[0])
page_content='3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
3' metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 1611}

Return scores:

# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("How does the Transformer encoder work?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)
Score: 0.668251152922073

page_content='Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.' metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 0}

Return documents based on similarity to an embedded query:

embedding = embeddings.embed_query("What are the advantages of the Transformer over RNNs?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])
page_content='Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base
model. All metrics are on the English-to-German translation development set, newstest2013. Listed
perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to
per-word perplexities.
N d model dff h d k dv Pdrop ϵls
train PPL BLEU params
steps (dev) (dev) ×106
base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65
(A)
1 512 512 5.29 24.9
4 128 128 5.00 25.5
16 32 32 4.91 25.8
32 16 16 5.01 25.4
(B) 16 5.16 25.1 58
32 5.01 25.4 60
(C)
2 6.11 23.7 36
4 5.19 25.3 50
8 4.88 25.5 80
256 32 32 5.75 24.5 28
1024 128 128 4.66 26.0 168
1024 5.12 25.4 53
4096 4.75 26.2 90
(D)
0.0 5.77 24.6
0.2 4.95 25.5
0.0 4.67 25.3
0.2 5.47 25.7
(E) positional embedding instead of sinusoids 4.92 25.7
big 6 1024 4096 16 0.3 300K 4.33 26.4 213
development set, newstest2013. We used beam search as described in the previous section, but no' metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 8, 'page_label': '9', 'start_index': 0}

Learn more:

4. Retrievers

Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "What is the main architecture proposed in the paper?",
        "What is self-attention?",
    ],
)
[[Document(id='3a9a01f8-683c-471c-9dda-ba075d260afd', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 770}, page_content='itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512.\nDecoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head\nattention over the output of the encoder stack. Similar to the encoder, we employ residual connections\naround each of the sub-layers, followed by layer normalization. We also modify the self-attention\nsub-layer in the decoder stack to prevent positions from attending to subsequent positions. This\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position i can depend only on the known outputs at positions less than i.\n3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,')],
 [Document(id='417ae00c-a9d5-49ee-b5f3-2a8ecb5cd8ed', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 1611}, page_content='3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3')]]

VectorStoreRetriever supports search types of "similarity" (default), "mmr" (maximum marginal relevance, described above), and "similarity_score_threshold". We can use the latter to threshold documents output by the retriever by similarity score.

Key Takeaways

  • Indexing vs. retrieval – Indexing turns raw data into structured documents and embeds them for search; retrieval finds relevant passages for a query. Together they form the basis of RAG (Retrieval Augmented Generation).
  • Documents – LangChain’s Document has page_content, metadata, and optional id. Use document loaders (e.g. PyPDFLoader) to load from PDFs and other sources.
  • Chunking – Split documents into smaller chunks (e.g. with RecursiveCharacterTextSplitter) so retrieval returns focused passages; use overlap to avoid cutting off context.
  • Embeddings – Text is turned into dense vectors (e.g. via HuggingFace all-mpnet-base-v2); similar meaning → similar vectors, enabling semantic (meaning-based) search instead of keyword match.
  • Vector stores – Store and query embeddings (e.g. InMemoryVectorStore); support sync/async, by query or by vector, with or without scores, and MMR for diversity.
  • Retrievers – Use vector_store.as_retriever() with search_type ("similarity", "mmr", "similarity_score_threshold") and search_kwargs (e.g. k) for easy integration into chains and RAG pipelines.