Start using gyaan vector database locally within 15 minutes
We recommend using uv for managing your python projects. It allows you to seamlessly manage dependencies at a project level, without disturbing your global python installation.
Run the following command at the intended location for your project to setup a new uv project.
With the environment set up, let us now create our first gyaan vector storage. Since gyaan is an on-disk vector database (vector embeddings are read from the disk on-demand), it is prudent to keep most operations async.
Therefore, to use gyaan, we have to use the async mode in python. Let us create our first vector database:
Explanation
create()
class method creates a new VectorStorage
object, that you can use later in your code. It assigns a random UUIDv6 as the store_id to this object, for time-stamped uniqueness.create()
also allows you to specify additional metadata information like title, description, keywords, and even a custom store_id, which is useful when you want to attach a vector store to a particular AI Agent or your app’s user via their ID.The metadata makes it easy for quick loading of an existing vector database into your python script for further computation
Let us retrieve the database that we created in the previous step.
If you added the optional metadata fields, you can quickly validate whether the database loaded correctly or not by printing those metadata fields. Here is thefull example : creating a vector database with metadata and then retrieving it from disk and displaying metadata fields
Let us now come to the core operation: inserting vector embeddings and the information it represents in the vector database.
To enable multi-modal support out of the box, each piece of information is stored as a file, handled as raw bytes. This allows you to store text, PDFs, media files and even code scripts as the content in gyaan vector database.
While gyaan supports inserting a single vector-content pair (simply keeping batch size as 1), we recommend inserting multiple vectors in one-go to utilize the inherent parallelization in the implementation.
For inserting a vector embedding-content pair, we need 3 things:
You do not have to worry about filename uniqueness. Internally, gyaan takes care of that by attaching a unique UUID v6 suffix to each filename. This suffix corresponds to the unique ID that gyaan assigns to each vector-content pair.
Let us insert 1 vector-content pair into our vector database that we created in the previous step
Note that the embedding is not normalized. Internally, gyaan automatically normalizes each embedding vector.
Let us now insert 3 more vectors, leveraging parallelism of the batch insertion. Note that for demo purposes, we are inserting 3 randomly generated vectors using numpy
but in practise you would be using the embedding provider of your choice.
You can view the most recently inserted vector embeddings via get_vectors
method, which returns up to 100 vectors by default
Internally, the vector database is broken down into clusters. For each query, via the index, we first find the relevant clusters, and then within those clusters, perform full search to get our desired number of vector embedding-content pairs
Therefore, for each search operation, we require 3 values:
top_k
: Maximum number of desired search results. Default value is 10centroid_search_radius
: Maximum number of clusters to filter for performing full search. Default value is 3distance_metric
: Currently, we support “euclidean” and “cosine”. Default value is “cosine”.The objective of clustering is to efficiently store and retrieve vectors.
Our search function returns the lists of filenames and filebytes as the response. For demo purposes, we are using a randomly generated numpy array with appropriate dimensions.