Quickstart

Setup

We recommend using uv for managing your python projects. It allows you to seamlessly manage dependencies at a project level, without disturbing your global python installation.

Install uv if you are currently using pip (recommended)

pip install uv

Create a new uv project

Run the following command at the intended location for your project to setup a new uv project.

uv init

Create a new virtual environment

uv venv

Add gyaan

uv add gyaan

Create your first vector database

With the environment set up, let us now create our first gyaan vector storage. Since gyaan is an on-disk vector database (vector embeddings are read from the disk on-demand), it is prudent to keep most operations async. Therefore, to use gyaan, we have to use the async mode in python. Let us create our first vector database:

import asyncio
from gyaan import VectorStorage

async def main():
    vector_store = await VectorStorage.create(
        vector_path="first_store",
        embedding_dim=1536)

asyncio.run(main())

Explanation

Loading an existing vector database

The metadata makes it easy for quick loading of an existing vector database into your python script for further computation Let us retrieve the database that we created in the previous step.

import asyncio
from gyaan import VectorStorage

async def main():
	retrieved_store=await VectorStorage.load(
        vector_path="first_store")

asyncio.run(main())

If you added the optional metadata fields, you can quickly validate whether the database loaded correctly or not by printing those metadata fields. Here is thefull example : creating a vector database with metadata and then retrieving it from disk and displaying metadata fields

import asyncio
from gyaan import VectorStorage

async def main():
    vector_store = await VectorStorage.create(
        vector_path="first_store",
        embedding_dim=1536,
        title="Your specified title",
        description="You can also add some description",
        store_id="abcdefgh", 
        keywords=["depths-ai","gyaan","vector database"])
    
    retrieved_store=await VectorStorage.load(
        vector_path="first_store")

    print("Title:",retrieved_store.title)
    print("Description:",retrieved_store.description)
    print("Store ID:",retrieved_store.id)
    print("Keywords:",retrieved_store.keywords)

asyncio.run(main())

Inserting vector embeddings and corresponding content

Let us now come to the core operation: inserting vector embeddings and the information it represents in the vector database. To enable multi-modal support out of the box, each piece of information is stored as a file, handled as raw bytes. This allows you to store text, PDFs, media files and even code scripts as the content in gyaan vector database.

Inserting a batch of vector embedding-content pairs (recommended)

While gyaan supports inserting a single vector-content pair (simply keeping batch size as 1), we recommend inserting multiple vectors in one-go to utilize the inherent parallelization in the implementation. For inserting a vector embedding-content pair, we need 3 things:

The vector embedding, having the same dimension as specified during database initialization
The content, as bytes
The filename with extension

You do not have to worry about filename uniqueness. Internally, gyaan takes care of that by attaching a unique UUID v6 suffix to each filename. This suffix corresponds to the unique ID that gyaan assigns to each vector-content pair.

Let us insert 1 vector-content pair into our vector database that we created in the previous step

content=b"Sample content"
embedding=[1.0]*1536
filename="example.txt"

await vector_store.batch_insert_vectors(
        embeddings=[embedding],
        files_bytes=[content],
        file_names=[filename])

Note that the embedding is not normalized. Internally, gyaan automatically normalizes each embedding vector.

Let us now insert 3 more vectors, leveraging parallelism of the batch insertion. Note that for demo purposes, we are inserting 3 randomly generated vectors using numpy but in practise you would be using the embedding provider of your choice.

import numpy as np

content=[b"cat",b"dog",b"horse"]
embeddings=np.random.random((3, 1536)).tolist()
filenames=["cat.txt","dog.txt","horse.txt"]

await vector_store.batch_insert_vectors(
    embeddings=embeddings,
    files_bytes=content,
    file_names=filenames
)

You can view the most recently inserted vector embeddings via get_vectors method, which returns up to 100 vectors by default

print(await vector_store.get_vectors())

Querying the vector database

Internally, the vector database is broken down into clusters. For each query, via the index, we first find the relevant clusters, and then within those clusters, perform full search to get our desired number of vector embedding-content pairs Therefore, for each search operation, we require 3 values:

Query vector embedding, having the same dimension as specified during database initialization
top_k : Maximum number of desired search results. Default value is 10
centroid_search_radius: Maximum number of clusters to filter for performing full search. Default value is 3
distance_metric: Currently, we support “euclidean” and “cosine”. Default value is “cosine”.

The objective of clustering is to efficiently store and retrieve vectors. Our search function returns the lists of filenames and filebytes as the response. For demo purposes, we are using a randomly generated numpy array with appropriate dimensions.

query=np.random.random((1, 1536)).tolist()[0] # replace with actual array

filenames, filebytes=await vector_store.search_vectors(query=query)

Get Started

Vector Database: Details

Setup

Install uv if you are currently using pip (recommended)

Create a new uv project

Create a new virtual environment

Add gyaan

Create your first vector database

Loading an existing vector database

Inserting vector embeddings and corresponding content

Inserting a batch of vector embedding-content pairs (recommended)

Querying the vector database

Soft-delete vector embeddings

Permanently delete vector embedding-content pairs

Get Started

Vector Database: Details

​Setup

​Install uv if you are currently using pip (recommended)

​Create a new uv project

​Create a new virtual environment

​Add gyaan

​Create your first vector database

​Loading an existing vector database

​Inserting vector embeddings and corresponding content

​Inserting a batch of vector embedding-content pairs (recommended)

​Querying the vector database

​Soft-delete vector embeddings

​Permanently delete vector embedding-content pairs

Setup

Install uv if you are currently using pip (recommended)

Create a new uv project

Create a new virtual environment

Add gyaan

Create your first vector database

Loading an existing vector database

Inserting vector embeddings and corresponding content

Inserting a batch of vector embedding-content pairs (recommended)

Querying the vector database

Soft-delete vector embeddings

Permanently delete vector embedding-content pairs