Generative AI

Bi-encoder and Cross-encoder Retrieval

Bi-encoders process queries and documents independently, enabling fast vector-based similarity search at scale.
Cross-encoders process query-document pairs simultaneously, allowing for deep interaction and higher accuracy at the cost of speed.
Modern retrieval systems typically use a two-stage pipeline: a Bi-encoder for initial candidate retrieval and a Cross-encoder for re-ranking.
The trade-off between latency and precision is the primary architectural consideration when designing search engines or RAG systems.

Why It Matters

E-commerce Search

Major retailers use Bi-encoders to index millions of product descriptions, allowing users to search using natural language queries like "comfortable running shoes for flat feet." Once the Bi-encoder retrieves the top 50 relevant items, a Cross-encoder re-ranks them to ensure the most popular and highly-rated items appear at the top. This combination ensures that the search is both fast enough to handle high traffic and accurate enough to drive conversions.

Legal Document Discovery

Law firms utilize these retrieval systems to search through thousands of case files and legal precedents. Because legal language is highly nuanced, the Cross-encoder is essential for identifying subtle semantic relationships between a specific legal argument and relevant past rulings. By using a two-stage pipeline, firms can quickly narrow down the relevant documents from a massive database and then perform a deep, accurate analysis on the most promising candidates.

Enterprise Knowledge Management

Large corporations often have internal wikis and documentation repositories that are difficult to search using traditional keyword methods. By deploying a Bi-encoder/Cross-encoder pipeline, employees can ask questions in plain English and receive precise answers extracted from internal PDFs and wikis. This improves productivity by reducing the time spent searching for internal policies or technical specifications.

How it Works

The Retrieval Challenge

In the era of Generative AI and Retrieval-Augmented Generation (RAG), the ability to find relevant information from a massive corpus is paramount. Imagine you are a librarian in a library with millions of books. If a user asks for a book about "the history of quantum computing," you cannot read every single book in the library to find the best match. You need a fast way to narrow down the search (retrieval) and a careful way to verify the quality of the top candidates (re-ranking). This is the fundamental problem that Bi-encoders and Cross-encoders solve.

Bi-encoders: The Speed Specialists

A Bi-encoder architecture treats the query and the document as two separate entities. Each is passed through a neural network (usually a BERT-based model) independently to produce a fixed-size vector. Because the document vectors can be pre-computed and stored in a vector database, the retrieval process becomes a simple distance calculation (like Cosine Similarity) between the query vector and the document vectors. This is incredibly fast, allowing us to search through millions of documents in milliseconds. However, because the query and document never "see" each other during the encoding process, the Bi-encoder misses out on the fine-grained interactions between specific words in the query and the document.

Cross-encoders: The Precision Specialists

A Cross-encoder takes a different approach. It concatenates the query and the document into a single input sequence and feeds them into a Transformer model together. This allows the model’s self-attention mechanism to compare every word in the query against every word in the document. The model can identify subtle nuances, such as whether a "not" in the query negates a key term in the document. While this produces highly accurate relevance scores, it is computationally expensive. You cannot pre-compute these scores because they are dependent on the specific query. If you have 1,000,000 documents, you would have to run the Cross-encoder 1,000,000 times for every single query, which is impossible in real-time.

The Hybrid Pipeline

The industry standard is to combine these two approaches. First, a Bi-encoder retrieves the top 100 or 1,000 most relevant documents from a vast index. This is the "retrieval" phase. Second, a Cross-encoder takes those 100 documents and re-ranks them to find the absolute best matches. This hybrid approach gives us the best of both worlds: the speed of Bi-encoders to handle scale, and the accuracy of Cross-encoders to ensure the final results are high-quality. This architecture is the backbone of modern search engines and RAG systems, ensuring that LLMs receive the most relevant context possible.

Common Pitfalls

"Cross-encoders are always better." While Cross-encoders are more accurate, they are not "better" in all contexts because they are too slow for large-scale retrieval. The correct approach is to use them only for re-ranking a small subset of results.
"Bi-encoders can be used for re-ranking." Using a Bi-encoder for re-ranking is inefficient because it doesn't capture the deep token-level interactions required for precise relevance scoring. Re-ranking should be reserved for Cross-encoders or other interaction-based models.
"Embeddings are static." Some learners believe that embeddings never change, but they are dependent on the model used to create them. If you switch your Bi-encoder model, you must re-index your entire document database.
"Retrieval is the same as generation." Retrieval is about finding existing information, while generation is about synthesizing new content. In RAG systems, retrieval is the critical first step that provides the "ground truth" for the generative model.

Sample Code

Python

from sentence_transformers import CrossEncoder, SentenceTransformer, util

# 1. Bi-encoder (for fast retrieval)
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
docs = ["The cat sits outside", "A man is playing guitar", "I love pasta"]
doc_emb = bi_encoder.encode(docs)
query = "How to play musical instruments?"
query_emb = bi_encoder.encode(query)

# Compute similarity (Bi-encoder)
scores = util.cos_sim(query_emb, doc_emb)
# Output: tensor([[0.12, 0.78, 0.05]]) -> "A man is playing guitar" is top

# 2. Cross-encoder (for precise re-ranking)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, doc) for doc in docs]
scores = cross_encoder.predict(pairs)
# Output: array([-0.5, 0.95, -1.2]) -> Higher score confirms precision

Key Terms

Embedding

A dense vector representation of data, such as text or images, that maps semantic meaning into a continuous numerical space. By placing similar concepts closer together in this space, models can perform mathematical operations to determine semantic similarity.

Vector Database

A specialized storage system designed to index and query high-dimensional vectors efficiently using algorithms like HNSW (Hierarchical Navigable Small World). These databases are essential for scaling Bi-encoder retrieval to millions or billions of documents.

Semantic Search

A search technique that focuses on the intent and contextual meaning of a query rather than simple keyword matching. It relies on the ability of models to understand synonyms, polysemy, and the relationships between words in a sentence.

Re-ranking

A post-processing step in information retrieval where a smaller subset of top-ranked results is evaluated by a more computationally expensive model. This ensures that the final output is highly relevant while maintaining system-wide performance.

Attention Mechanism

A component of Transformer models that allows the network to weigh the importance of different words in a sequence relative to one another. In Cross-encoders, this mechanism allows the model to look at the query and the document simultaneously to find specific matching tokens.

Latency

The time delay between a user submitting a query and the system returning the results. In retrieval systems, minimizing latency is critical for user experience, often requiring a balance between model complexity and response speed.

Inference

The process of using a trained machine learning model to make predictions on new, unseen data. In the context of retrieval, this involves converting text into vectors or calculating relevance scores for query-document pairs.