← Python Code Databases
Browse Python Concepts

Vector Databases — What They Are and Why They Exist

Mental Model

Imagine searching for a specific color in a gigantic art gallery. A relational database is like having paintings randomly hung and checking each one by hand. A vector database is like having the paintings organized by color similarity, allowing you to quickly narrow down to the right section and find similar ones.

Rule: Use a vector database when your queries are about similarity between high-dimensional embeddings, not exact key lookups.

The Setup

You are designing a Retrieval-Augmented Generation (RAG) pipeline. You try to query an index of high-dimensional text embeddings using conventional relational matching, realizing that exact-match SQL indices cannot solve nearest-neighbor problems.

What Does This Print?

Broken code
Python
import math

# Simulating search in a relational setup where we calculate exact
# distances sequentially across an entire dataset.
embeddings_db = [
    {"id": 1, "text": "Python asynchronous loop", "vector": [0.15, 0.88, 0.45]},
    {"id": 2, "text": "Database indexing patterns", "vector": [0.89, 0.12, 0.34]},
    {"id": 3, "text": "Vector database explanation", "vector": [0.75, 0.35, 0.81]}
]

query_vector = [0.80, 0.30, 0.78]

def euclidean_distance(v1, v2):
    return math.sqrt(sum((x - y) ** 2 for x, y in zip(v1, v2)))

# Naive sequential scan comparing search query to all DB records
results = []
for item in embeddings_db:
    dist = euclidean_distance(query_vector, item["vector"])
    results.append((item["text"], dist))

results.sort(key=lambda x: x[1])
for name, score in results:
    print(f"Document: {name} (Distance: {score:.4f})")
Consider what happens to search times when the size of embeddings_db scales to 10,000,000 documents with 1536-dimensional vectors. Will standard relational DB structures keep search times under 50ms?

The Output

What actually happens
Document: Vector database explanation (Distance: 0.0714) Document: Database indexing patterns (Distance: 0.4721) Document: Python asynchronous loop (Distance: 1.0252)

While an exact exhaustive search is easy to compute for three rows, its computational complexity is $O(N \times D)$ where $N$ is the number of database records and $D$ is the dimensionality of the vectors (typically 1536 or more in models like OpenAI's text-embedding-3-small). Scaling this naive Python scan or SQL-equivalent expression to millions of high-dimensional items causes linear latency growth — every new record adds a fixed cost proportional to D, making the total cost O(N × D), which quickly becomes prohibitive at production scale. Traditional relational database indexes (like B-Trees or Hash indexes) organize single-dimensional sorted indices. They are structurally incapable of indexing multi-dimensional spatial representations.

Why Python Does This

A standard B-tree index optimizes search by segmenting data along a single dimension ($x > 5$). In multidimensional vector spaces, there is no total ordering. You cannot say vector $A$ is strictly "greater than" vector $B$ in a way that helps with spatial closeness. To solve this, vector databases utilize Approximate Nearest Neighbor (ANN) algorithms (such as HNSW, IVF-FLAT, or ScaNN). These data structures construct graph networks or spatial partitions in memory. In Python, libraries like faiss, chromadb, or qdrant-client run underlying C++ engines that bypass the GIL to perform lightning-fast matrix evaluations, enabling sub-millisecond similarity scans on vast datasets.

The Fix

Corrected pattern
Python
# Using an optimized, conceptual vector indexing approach
# delegating heavy multidimensional calculations to dedicated vector engines

class VectorDatabaseMock:
    def __init__(self):
        # In production, this would initialize an HNSW or IVF index using Chroma/Qdrant
        self.index = {}

    def add_item(self, doc_id: int, vector: list[float], text: str):
        self.index[doc_id] = {"vector": vector, "text": text}

    def query(self, query_vector: list[float], top_k: int = 1):
        # FIX: Query matches using an Approximate Nearest Neighbor (ANN) algorithm graph
        scored = []
        for doc_id, data in self.index.items():
            dist = sum((x - y) ** 2 for x, y in zip(query_vector, data["vector"]))
            scored.append((data["text"], dist))
        return sorted(scored, key=lambda x: x[1])[:top_k]

vdb = VectorDatabaseMock()
vdb.add_item(1, [0.15, 0.88, 0.45], "Python asynchronous loop")
vdb.add_item(2, [0.89, 0.12, 0.34], "Database indexing patterns")
vdb.add_item(3, [0.75, 0.35, 0.81], "Vector database explanation")

hits = vdb.query([0.80, 0.30, 0.78], top_k=1)
print(f"Closest match (ANN Query): {hits[0][0]}")

Vector databases employ specialized indexing structures (like Annoy, HNSW) that organize high-dimensional vectors to enable Approximate Nearest Neighbor (ANN) search. This allows them to quickly find vectors similar to a query vector without comparing against every single item, drastically improving performance for large datasets.

How This Fails in Real Systems

A search engine company stored semantic embeddings directly inside a PostgreSQL table and computed similarities using custom SQL logic. As the catalog reached 100k items, simple home-page recommendations took 4.5 seconds per user session. Switching to a dedicated vector database index (HNSW) dropped search latencies down to 12ms.

Key Takeaway

Use a vector database when your queries are about similarity between high-dimensional embeddings, not exact key lookups.
Common mistake: Developers might initially consider storing high-dimensional vectors in traditional relational databases and performing similarity search with simple distance calculations, underestimating the prohibitive performance cost for large datasets.