Vector databases
Vector databases are a key part of building scalable AI-powered applications. They provide a form of “long term memory” on top of an existing ML model.
Without a vector database, you would need to train your own model(s) or re-run your dataset through a model before making a query, which would be both slow and expensive.
Why is a vector database useful?
A vector database enables many tasks: determining what other data (represented as vectors) is “close to” your input query allows you to build multiple different use-cases on top of a vector database, including:
- Semantic search - “find sentences that are similar to the one I input”
- Classification - “classify this input - e.g. tell me what groupings in my data it is closest to”
- Recommendation engines - “return content that is similar to these inputs based on my own product sales or user history”
- Anomaly detection - “is this data point similar to existing data, or different?”
Vector databases can also power Retrieval Augmented Generation (RAG) tasks, which allow you to bring additional context to LLMs (Large Language Models) by using the context from a vector search to augment the user prompt.
Vector search
In a traditional vector search use-case, queries are made against a vector database by passing it a query vector, and having the vector database return a configurable list of vectors with the shortest distance (“most similar”) to the query vector.
The step-by-step workflow resembles the below:
- A developer turns their existing dataset (docs, images, logs stored in R2) into a set of vector embeddings (a one-way representation) by passing them through an machine learning model that is trained for that data type.
- The output embeddings are inserted into a Vectorize database index.
- A search query, classification request or anomaly detection query is also passed through the same ML model, returning an vector embedding representation of the query
- Vectorize is queried with this embedding, and returns a set of the most similar vector embeddings to the provided query
- The returned embeddings are used to retrieve the original source objects from dedicated storage (e.g. R2, KV, D1) and returned back to the user.
In a workflow without a vector database, you would need to pass your entire dataset alongside your query each time, which is neither practical (models have limits on input size) and would consume significant resources and time.
Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is an approach used to improve the context provided to an LLM (Large Language Model) in generative AI use-cases, including chatbot and general question-answer applications. The vector database is used to enhance the prompt passed to the LLM by adding additional context alongside the query.
Instead of passing the prompt directly to the LLM, in the RAG approach you:
- Generate vector embeddings from an existing dataset or corpus - e.g. the dataset you want to use to add additional context to the LLMs response. This could be product documentation, research data, technical specifications, or your product catalog and descriptions.
- Store the output embeddings in a Vectorize database index.
When a user initiates a prompt, instead of passing it (without additional context) to the LLM, we augment it with additional context:
- The user prompt is passed into the same ML model used for our dataset, returning a vector embedding representation of the query.
- This embedding is used as the query (semantic search) against the vector database, which returns similar vectors.
- These vectors are used to look up the content they relate to (if not embedded directly alongside the vectors as metadata).
- This content is provided as context alongside the original user prompt, providing additional context to the LLM and allowing it to return an answer that is likely to be far more contextual than the standalone prompt.
Visit the RAG tutorial using Workers AI to learn how to combine Workers AI and Vectorize for generative AI use-cases.
1 You can learn more about the theory behind RAG by reading the RAG paper
Terminology
Databases and indexes
In Vectorize, a database and an index are the same concept: each index you create is separate from other indexes you create. Vectorize automatically manages optimizing and re-generating the index for you when you insert new data.
Vector Embeddings
Vector embeddings represent the features of a machine learning model as a numerical vector (array of numbers). They are a one-way representation that encodes how a machine learning model understands the input(s) provided to it, based on how the model was originally trained and its’ internal structure.
For example, a text embedding model available in Workers AI is able to take text input and represent it as a 768-dimension vector. The text This is a story about an orange cloud, when represented as a vector embedding, resembles the following:
[-0.019273685291409492,-0.01913292706012726,<764 dimensions here>,0.0007094172760844231,0.043409910053014755]
When a model considers the features of an input as “similar” (based on its understanding), the distance between the vector embeddings for those two inputs will typically have a short distance between them.
Dimensions
Vector dimensions describe the width of a vector embedding: the number of floating point elements that comprise a given vector.
The number of dimensions are defined by the machine learning model used to generate the vector embeddings, and how it represents input features based on its internal model and complexity. More dimensions (“wider” vectors) may provide more accuracy at the cost of compute and memory resources, as well as latency (speed) of vector search.
Refer to the dimensions documentation to learn how to configure the accepted vector dimension size when creating a Vectorize index.
Distance metrics
The distance metric an index uses for vector search defines how it determines how “close” your query vector is to other vectors within the index.
- Distance metrics determine how the vector search engine assesses “similarity” between vectors.
- Cosine, Euclidean (L2), and Dot Product are the most commonly used distance metrics in vector search.
- The machine learning model and type of embedding you use will typically determine which distance metric is best suited for your use-case.
- Different metrics determine different scoring characteristics. For example, the
cosinedistance metric is well suited to text, sentence similarity and/or document search use-cases; whereaseuclideancan be better suited to image or speech recognition use-cases.
Refer to the distance metrics documentation to learn how to configure a distance metric when creating a Vectorize index.