Setup

We recommend using uv for managing your python projects. It allows you to seamlessly manage dependencies at a project level, without disturbing your global python installation.

pip install uv

Create a new uv project

Run the following command at the intended location for your project to setup a new uv project.

uv init

Create a new virtual environment

uv venv

Add gyaan

uv add gyaan

Create your first vector database

With the environment set up, let us now create our first gyaan vector storage. Since gyaan is an on-disk vector database (vector embeddings are read from the disk on-demand), it is prudent to keep most operations async.

Therefore, to use gyaan, we have to use the async mode in python. Let us create our first vector database:

import asyncio
from gyaan import VectorStorage

async def main():
    vector_store = await VectorStorage.create(
        vector_path="first_store",
        embedding_dim=1536)

asyncio.run(main())

Loading an existing vector database

The metadata makes it easy for quick loading of an existing vector database into your python script for further computation

Let us retrieve the database that we created in the previous step.

import asyncio
from gyaan import VectorStorage

async def main():
	retrieved_store=await VectorStorage.load(
        vector_path="first_store")

asyncio.run(main())

If you added the optional metadata fields, you can quickly validate whether the database loaded correctly or not by printing those metadata fields. Here is thefull example : creating a vector database with metadata and then retrieving it from disk and displaying metadata fields

import asyncio
from gyaan import VectorStorage

async def main():
    vector_store = await VectorStorage.create(
        vector_path="first_store",
        embedding_dim=1536,
        title="Your specified title",
        description="You can also add some description",
        store_id="abcdefgh", 
        keywords=["depths-ai","gyaan","vector database"])
    
    retrieved_store=await VectorStorage.load(
        vector_path="first_store")

    print("Title:",retrieved_store.title)
    print("Description:",retrieved_store.description)
    print("Store ID:",retrieved_store.id)
    print("Keywords:",retrieved_store.keywords)

asyncio.run(main())

Inserting vector embeddings and corresponding content

Let us now come to the core operation: inserting vector embeddings and the information it represents in the vector database.

To enable multi-modal support out of the box, each piece of information is stored as a file, handled as raw bytes. This allows you to store text, PDFs, media files and even code scripts as the content in gyaan vector database.

While gyaan supports inserting a single vector-content pair (simply keeping batch size as 1), we recommend inserting multiple vectors in one-go to utilize the inherent parallelization in the implementation.

For inserting a vector embedding-content pair, we need 3 things:

  1. The vector embedding, having the same dimension as specified during database initialization
  2. The content, as bytes
  3. The filename with extension

You do not have to worry about filename uniqueness. Internally, gyaan takes care of that by attaching a unique UUID v6 suffix to each filename. This suffix corresponds to the unique ID that gyaan assigns to each vector-content pair.

Let us insert 1 vector-content pair into our vector database that we created in the previous step

content=b"Sample content"
embedding=[1.0]*1536
filename="example.txt"

await vector_store.batch_insert_vectors(
        embeddings=[embedding],
        files_bytes=[content],
        file_names=[filename])

Note that the embedding is not normalized. Internally, gyaan automatically normalizes each embedding vector.

Let us now insert 3 more vectors, leveraging parallelism of the batch insertion. Note that for demo purposes, we are inserting 3 randomly generated vectors using numpy but in practise you would be using the embedding provider of your choice.

import numpy as np

content=[b"cat",b"dog",b"horse"]
embeddings=np.random.random((3, 1536)).tolist()
filenames=["cat.txt","dog.txt","horse.txt"]

await vector_store.batch_insert_vectors(
    embeddings=embeddings,
    files_bytes=content,
    file_names=filenames
)

You can view the most recently inserted vector embeddings via get_vectors method, which returns up to 100 vectors by default

print(await vector_store.get_vectors())

Querying the vector database

Internally, the vector database is broken down into clusters. For each query, via the index, we first find the relevant clusters, and then within those clusters, perform full search to get our desired number of vector embedding-content pairs

Therefore, for each search operation, we require 3 values:

  1. Query vector embedding, having the same dimension as specified during database initialization
  2. top_k : Maximum number of desired search results. Default value is 10
  3. centroid_search_radius: Maximum number of clusters to filter for performing full search. Default value is 3
  4. distance_metric: Currently, we support “euclidean” and “cosine”. Default value is “cosine”.

The objective of clustering is to efficiently store and retrieve vectors.

Our search function returns the lists of filenames and filebytes as the response. For demo purposes, we are using a randomly generated numpy array with appropriate dimensions.

query=np.random.random((1, 1536)).tolist()[0] # replace with actual array

filenames, filebytes=await vector_store.search_vectors(query=query)

Soft-delete vector embeddings

Permanently delete vector embedding-content pairs