Setup
We recommend using uv for managing your python projects. It allows you to seamlessly manage dependencies at a project level, without disturbing your global python installation.Install uv if you are currently using pip (recommended)
Create a new uv project
Run the following command at the intended location for your project to setup a new uv project.Create a new virtual environment
Add gyaan
Create your first vector database
With the environment set up, let us now create our first gyaan vector storage. Since gyaan is an on-disk vector database (vector embeddings are read from the disk on-demand), it is prudent to keep most operations async. Therefore, to use gyaan, we have to use the async mode in python. Let us create our first vector database:Explanation
Explanation
- The
create()
class method creates a newVectorStorage
object, that you can use later in your code. It assigns a random UUIDv6 as the store_id to this object, for time-stamped uniqueness. - You have to specify the relative path for storage and the dimension of your vector embeddings upfront as the bare minimum params.
- It also creates 4 directories inside the specified relative path. In our case, we specify the path as “first_store”, which would create a sub-directory titled “first_store” in the directory where you are running your python script.
- These 4 directories are: (a) kv: stores the actual embedding (key) and file path (value) pairs (b) files: The “values” or the information for which you are generating embeddings. This style allows us to store multi-modal vector embedding stores, allowing file to be of any format (c) metadata: The “definition” of the vector store (d) index: Index for your vector embeddings for efficient storage and retrieval.
- The
create()
also allows you to specify additional metadata information like title, description, keywords, and even a custom store_id, which is useful when you want to attach a vector store to a particular AI Agent or your app’s user via their ID.
Loading an existing vector database
The metadata makes it easy for quick loading of an existing vector database into your python script for further computation Let us retrieve the database that we created in the previous step.Inserting vector embeddings and corresponding content
Let us now come to the core operation: inserting vector embeddings and the information it represents in the vector database. To enable multi-modal support out of the box, each piece of information is stored as a file, handled as raw bytes. This allows you to store text, PDFs, media files and even code scripts as the content in gyaan vector database.Inserting a batch of vector embedding-content pairs (recommended)
While gyaan supports inserting a single vector-content pair (simply keeping batch size as 1), we recommend inserting multiple vectors in one-go to utilize the inherent parallelization in the implementation. For inserting a vector embedding-content pair, we need 3 things:- The vector embedding, having the same dimension as specified during database initialization
- The content, as bytes
- The filename with extension
You do not have to worry about filename uniqueness. Internally, gyaan takes care of that by attaching a unique UUID v6 suffix to each filename. This suffix corresponds to the unique ID that gyaan assigns to each vector-content pair.
Note that the embedding is not normalized. Internally, gyaan automatically normalizes each embedding vector.
numpy
but in practise you would be using the embedding provider of your choice.
get_vectors
method, which returns up to 100 vectors by default
Querying the vector database
Internally, the vector database is broken down into clusters. For each query, via the index, we first find the relevant clusters, and then within those clusters, perform full search to get our desired number of vector embedding-content pairs Therefore, for each search operation, we require 3 values:- Query vector embedding, having the same dimension as specified during database initialization
top_k
: Maximum number of desired search results. Default value is 10centroid_search_radius
: Maximum number of clusters to filter for performing full search. Default value is 3distance_metric
: Currently, we support “euclidean” and “cosine”. Default value is “cosine”.