Day 2: Embeddings

Module Overview

Session: Embeddings

What are embeddings and why they matter
Types of embeddings: text, image, structured, graph
Evaluating embedding quality

Session: Embeddings

What You’ll Learn Today

By the end of this session, you will:

Understand embeddings as numerical representations of data
Know how to evaluate embedding quality
Recognize different types of embeddings and their use cases

What Are Embeddings?

Numerical representations of real-world data

Embeddings as vectors in a low-dimensional space representing text, images, and other data types

Let’s start with the fundamental question: what are embeddings?

Embeddings are numerical representations of real-world data - text, images, audio, videos, anything. The name comes from mathematics, where one space can be mapped, or embedded, into another space.

For example, BERT embeds text into a vector of 768 numbers. This maps from the very high-dimensional space of all possible sentences to a much smaller 768-dimensional space.

The key insight is that embeddings are low-dimensional vectors where the geometric distance between two vectors reflects the semantic similarity between the real-world objects they represent. If two vectors are close in the embedding space, the objects they represent are semantically similar.

This is powerful because it gives you compact representations that retain important semantic properties while enabling efficient computation and storage.

Why Embeddings Matter

Compact, meaningful representations

Lossy compression: Reduce dimensionality while preserving semantics
Similarity comparison: Measure relationships numerically
Multimodal alignment: Map different data types to the same space
Efficient computation: Enable fast search and retrieval

So why do embeddings matter? There are several key reasons.

First, they act as lossy compression. You’re reducing the dimensionality of your data - from potentially infinite-dimensional text space to a fixed-size vector - while preserving important semantic properties.

Second, they enable similarity comparison. You can measure how similar two objects are by computing the distance between their embeddings. This is much more powerful than exact matching.

Third, they enable multimodal alignment. You can map text, images, audio, and other data types into the same embedding space, allowing you to search across modalities - like finding videos based on text queries.

Fourth, they enable efficient computation. Once your data is in vector form, you can use optimized algorithms for search, clustering, and other operations that would be impossible or too slow on raw data.

These properties make embeddings essential for modern AI applications.

Embeddings in Retrieval Systems

The three-step process

Precompute embeddings for billions of items
Map query embeddings into the same space
Retrieve nearest neighbors efficiently

The power of embeddings becomes clear when you look at how modern retrieval systems work. Think about Google Search - it’s a retrieval task over the search space of the entire internet.

Today’s retrieval systems succeed through three steps:

First, you precompute embeddings for billions of items in your search space. This is done offline, so you can use expensive models and take your time.

Second, when a query comes in, you map it into the same embedding space. This is done in real-time, so you need fast models.

Third, you efficiently compute and retrieve items whose embeddings are nearest neighbors of the query embedding. This is where vector search algorithms and vector databases come in.

This three-step process is what makes semantic search possible at scale. Without embeddings, you’d be stuck with keyword matching. With embeddings, you can find semantically similar content even if it uses different words.

Joint Embeddings for Multimodality

Multiple data types in one space

Objects of different types (text, images, videos) projected into a joint vector space with semantic meaning

Task-Specific Embeddings

2D visualization of pre-trained GloVe and Word2Vec word embeddings

Different embeddings for different tasks

Same object → different embeddings
Optimized for the task at hand
Semantic meaning preserved for specific use cases

Evaluating Embedding Quality

Why Evaluation Matters

Bad embeddings → bad results

Embeddings are not automatically good
Quality depends on model, data, and task
Evaluation tells you if embeddings work for your use case

Evaluation Methods

Intrinsic evaluation: Direct measurement of embedding properties
Extrinsic evaluation: Performance on downstream tasks (IR, classification, ..etc.)
Benchmarks: MTEB, BEIR for standardized comparison

Search Example: The Challenge

Finding relevant documents

Query: “How do I reset my password?”
Corpus: Millions of support articles
Goal: Find the most relevant articles

Types of Embeddings

Text Embeddings Overview

From words to documents

Word embeddings: Individual words → vectors
Document embeddings: Entire documents → vectors
Context-aware: Meaning depends on surrounding text

Word Embeddings

Individual words as vectors

Word2Vec: Predicts words from context
GloVe: Global word-word co-occurrence statistics
FastText: Handles out-of-vocabulary words

Word embeddings showing similar words clustered together in vector space

Word embeddings were one of the first breakthroughs in NLP. They map individual words to fixed-size vectors.

Word2Vec was one of the earliest and most influential. It learns embeddings by predicting words from their context - words that appear in similar contexts get similar embeddings.

GloVe takes a different approach - it uses global word-word co-occurrence statistics from a corpus. It’s based on the idea that words that co-occur frequently are likely related.

FastText extends Word2Vec by representing words as bags of character n-grams. This means it can handle out-of-vocabulary words by breaking them into subword units.

The key insight is that these embeddings capture semantic relationships - words with similar meanings end up close in the embedding space. You can do arithmetic on them - “king” minus “man” plus “woman” gives you something close to “queen.”

Document Embeddings: Shallow Models

Bag-of-words approaches

TF-IDF: Term frequency × inverse document frequency
BM25: Probabilistic ranking function
Limitations: No word order, no context

Before deep learning, document embeddings were mostly based on bag-of-words approaches. These are called “shallow” because they don’t use deep neural networks.

TF-IDF - Term Frequency times Inverse Document Frequency - is a classic approach. It weights words by how common they are in the document relative to how common they are across all documents. Common words get lower weights.

BM25 is a probabilistic ranking function that’s still widely used today, especially in search engines. It’s an improvement over TF-IDF that handles document length normalization better.

The main limitation of these approaches is that they ignore word order and context. “The cat sat on the mat” and “The mat sat on the cat” would have the same representation, even though they mean completely different things.

They also can’t capture semantic relationships - “car” and “automobile” are treated as completely different words.

Document Embeddings: Deep Models

Context-aware neural embeddings

BERT: Bidirectional encoder representations
Sentence-BERT: Efficient sentence embeddings
Modern LLMs: GPT, T5, Gemini embeddings

Single-vector vs. Multi-vector encoders

Deep models changed everything. They use neural networks to learn embeddings that capture context and semantic meaning.

BERT - Bidirectional Encoder Representations from Transformers - was a breakthrough. It reads text in both directions simultaneously, so it understands context. The word “bank” gets different embeddings depending on whether it’s “river bank” or “financial bank.”

Sentence-BERT makes BERT practical for sentence-level tasks. Regular BERT requires running the full model for every sentence pair you want to compare, which is too slow. Sentence-BERT fine-tunes BERT to produce sentence embeddings directly, making similarity comparison much faster.

Modern LLMs like GPT, T5, and Gemini can also produce embeddings. These are often very high quality because they’re trained on massive amounts of data, but they can be expensive to run.

The key advantage of deep models is that they understand context and semantics. They can tell that “car” and “automobile” are similar, and that “The cat sat on the mat” is different from “The mat sat on the cat.”

Image & Multimodal Embeddings

Visual understanding in vector space

CNN-based: Convolutional neural networks for images
Vision transformers: ViT, CLIP for joint text-image space
Multimodal: Same space for text, images, videos

Images and text projected into a joint embedding matrix (OpenAI’s CLIP model)

Images can also be embedded into vectors. Early approaches used CNNs - convolutional neural networks - trained on image classification tasks. The final layers of these networks produce image embeddings.

Vision transformers, like ViT, apply the transformer architecture to images by splitting them into patches. These often work better than CNNs for many tasks.

CLIP is particularly interesting - it learns a joint embedding space for images and text. You can embed both an image and a text description into the same space, and similar images and texts will be close together. This enables powerful applications like image search with text queries.

Multimodal embeddings extend this to videos, audio, and other data types. The goal is always the same: map different modalities into the same space so you can search and compare across them.

Structured Data Embeddings

Tables, graphs, and relational data

General structured data: Feature engineering + ML
User-item data: Collaborative filtering embeddings
Graph embeddings: Node and edge representations

Structured data - like tables, databases, and graphs - can also be embedded.

For general structured data, you typically do feature engineering to convert rows into feature vectors, then use machine learning models to learn embeddings. This is common in recommendation systems.

For user-item data, collaborative filtering learns embeddings for users and items simultaneously. Users with similar preferences get similar embeddings, and items that are liked by similar users get similar embeddings.

Graph embeddings are particularly interesting. They embed nodes and edges in a graph into vectors, preserving the graph structure. Nodes that are connected or have similar neighborhoods get similar embeddings. This is useful for social networks, knowledge graphs, and other graph-structured data.

The key insight is that embeddings aren’t just for text and images - you can embed any structured data into vectors if you design the right approach.

Training Embeddings

Training Approaches

How embeddings are learned

Supervised: Task-specific labels guide learning
Self-supervised: Learn from data structure itself
Transfer learning: Pre-trained models fine-tuned

Now let’s talk about how embeddings are actually trained. There are several approaches:

Supervised training uses task-specific labels. For example, you might train embeddings for sentiment analysis by using labeled examples of positive and negative text. The embeddings learn to separate positive and negative examples.

Self-supervised training learns from the data structure itself, without explicit labels. Word2Vec is self-supervised - it learns from word co-occurrence patterns. BERT is also self-supervised - it learns by predicting masked words.

Transfer learning takes a pre-trained model and fine-tunes it for your specific task. This is often the most practical approach - you start with a model trained on massive amounts of data, then fine-tune it for your use case.

The choice depends on your data, your task, and your resources. Pre-trained models are often the best starting point because they’ve learned from much more data than you have.

Pre-trained vs. Custom Embeddings

When to train your own

Pre-trained: Use when available, often better
Custom: Needed for domain-specific tasks
Fine-tuning: Best of both worlds

Should you use pre-trained embeddings or train your own?

Pre-trained embeddings are usually the better choice. They’re trained on massive datasets, they’re well-tested, and they often perform better than what you could train yourself. Unless you have a very specific domain or very specific requirements, start with pre-trained.

Custom embeddings make sense when you have domain-specific data that’s very different from what pre-trained models were trained on. For example, if you’re working with medical text, legal documents, or highly technical content, domain-specific embeddings might help.

Fine-tuning is often the best approach - start with a pre-trained model, then fine-tune it on your domain-specific data. This gives you the benefits of both: the general knowledge from pre-training and the domain specificity from fine-tuning.

The key is to evaluate. Don’t assume custom is better - test both and see what works for your use case.