How to use Milvus to Store and Query Vector Embeddings

How to use Milvus to Store and Query Vector Embeddings blog post fallback blur image How to use Milvus to Store and Query Vector Embeddings blog post main image
Stephen CollinsJan 6, 2024
3 mins

What you will learn

  • What is Milvus and why is it important for AI applications?
  • Milvus is an open-source vector database specifically designed for handling large datasets in AI applications. It provides scalable, reliable, and fast search capabilities for high-dimensional vector data, making it suitable for tasks such as image recognition, natural language processing, and recommendation systems.
  • How do you establish a connection to a Milvus server using Python?
  • You can connect to a Milvus server using the `connections.connect` method from the `pymilvus` library. This method requires specifying the server's host and port, and it's important to include error handling to catch any connection issues.
  • What steps are involved in creating a collection in Milvus?
  • Creating a collection in Milvus involves defining a schema with fields that specify the data's structure, such as primary keys and embeddings. You use the `CollectionSchema` to define the structure and then create the collection with a specified name and schema.
  • How are text embeddings generated for insertion into a Milvus collection?
  • Text embeddings are generated by using a pre-trained model from the `transformers` library to convert text into numerical vectors. This conversion allows the text to be represented as vectors, which can then be inserted into the Milvus collection for storage and retrieval.
  • What is the purpose of indexing in Milvus, and how is it done?
  • Indexing in Milvus is used to enhance the efficiency of vector search operations. By creating an index, such as the IVF_FLAT index type, on the embeddings field, search operations can be performed more quickly, enabling faster retrieval of similar vectors based on specific metrics.

In today’s data-driven world, managing and searching through large datasets has become increasingly important. One powerful tool for handling this challenge is Milvus, an open-source vector database designed for AI applications. In this blog post, we’ll explore a practical implementation of Milvus using Python, showcasing how it can be integrated with text embedding techniques to create an efficient search system.

All code for this blog post can be found in this companion GitHub repository.

Milvus: The Vector Database

Milvus is designed to provide scalable, reliable, and fast search capabilities for vector data. It’s particularly suited for applications like image and video recognition, natural language processing, and recommendation systems, where data can be represented as high-dimensional vectors.

Setting Up Milvus

Before diving into the code, ensure you have Milvus installed and running. The first step in our Python script is to establish a connection with the Milvus server:

from pymilvus import connections

def connect_to_milvus():
    try:
        connections.connect("default", host="localhost", port="19530")
        print("Connected to Milvus.")
    except Exception as e:
        print(f"Failed to connect to Milvus: {e}")
        raise

This function attempts to connect to a Milvus server running on the local machine. Error handling is crucial to catch and understand any issues that might arise during the connection.

Creating a Collection in Milvus

A collection in Milvus is like a table in a traditional database. It’s where our data will be stored. Each collection can have multiple fields, akin to columns in a table. In our example, we create a collection with three fields: a primary key (pk), a source text (source), and embeddings (embeddings):

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

def create_collection(name, fields, description):
    schema = CollectionSchema(fields, description)
    collection = Collection(name, schema, consistency_level="Strong")
    return collection

# Define fields for our collection
fields = [
    FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
    FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=768)
]

collection = create_collection("hello_milvus", fields, "Collection for demo purposes")

In this code snippet, the embeddings have a dimension of 768 (for our specific custom embeddings model, mentioned in the next section), which should align with the output of the embedding model you use.

Generating Text Embeddings in Python

Before we insert data into our Milvus collection, we need to generate text embeddings. This process involves using a pre-trained model from the transformers library to convert text into numerical vectors. In our code, we use the thenlper/gte-base model for this purpose. This process for our app is abstracted by our embedding_util.py module that handles creating vector embeddings.

For more details on how our custom embedding_util.py module works for creating vector embeddings, check out my blog post on how to use weaviate to store and query vector embeddings.

Generating and Inserting Data

To generate embeddings from text, we use the previously mentioned pre-trained model from the transformers library. This model converts text into numerical vectors that can be stored in our Milvus collection:

from embedding_util import generate_embeddings

documents = [...]
embeddings = [generate_embeddings(doc) for doc in documents]
entities = [
    [str(i) for i in range(len(documents))],
    [str(doc) for doc in documents],
    embeddings
]

insert_result = insert_data(collection, entities)

The insert_data function inserts our data into the Milvus collection and then flushes the operations to ensure data persistence.

Creating an Index for Efficient Searching

Milvus uses indexes to speed up the search process. Here, we create an IVF_FLAT index on the embeddings field:

def create_index(collection, field_name, index_type, metric_type, params):
    index = {"index_type": index_type, "metric_type": metric_type, "params": params}
    collection.create_index(field_name, index)

create_index(collection, "embeddings", "IVF_FLAT", "L2", {"nlist": 128})

With our data indexed, we can now perform searches based on vector similarity:

def search_and_query(collection, search_vectors, search_field, search_params):
    collection.load()
    result = collection.search(search_vectors, search_field, search_params, limit=3, output_fields=["source"])
    print_search_results(result, "Vector search results:")

query = "Give me some content about the ocean"
query_vector = generate_embeddings(query)
search_and_query(collection, [query_vector], "embeddings", {"metric_type": "L2", "params": {"nprobe": 10}})

In this search, we’re looking for the top 3 documents most similar to the query “Give me some content about the ocean”.

If you are able to run the app successfully, you should see the following vector search results, sorted by cosine distance (smaller is more semantically similar):

Vector search results:
Hit: id: 6, distance: 0.39819106459617615, entity: {'source': 'The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.'}, source field: The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.
Hit: id: 4, distance: 0.4780573844909668, entity: {'source': 'The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.'}, source field: The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.
Hit: id: 0, distance: 0.4835127890110016, entity: {'source': 'A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.'}, source field: A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.

Cleaning Up

After completing our operations, it’s good practice to clean up by deleting entities and dropping the collection:

delete_entities(collection, f'pk in ["{insert_result.primary_keys[0]}", "{insert_result.primary_keys[1]}"]')
drop_collection("hello_milvus")

Conclusion

Milvus offers a powerful and flexible way to work with vector data. By combining it with natural language processing techniques, we can build sophisticated search and recommendation systems. The Python script demonstrated here is a basic example, but the potential applications are vast and varied.

Whether you’re dealing with large-scale image databases, complex recommendation systems, or advanced NLP tasks, Milvus can be an invaluable tool in your AI arsenal.