Skip to content

feat: support milvus #1666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 2, 2023
Merged

Conversation

maxwelljin
Copy link
Contributor

@maxwelljin maxwelljin commented Jun 21, 2023

This PR aims to support Milvus as a storage backend

  • Index creation
  • Get documents
  • Search
  • Filter
  • Subindex
  • Tests

Copy link
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are on a good path here! Just a few comments


def _init_index(self) -> Collection:
if not utility.has_collection(self._db_config.collection_name):
print(self._db_config.collection_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove the print

Comment on lines 134 to 136
name="doc_id" if column_name == "id" else column_name,
dtype=DataType.VARCHAR if column_name == "id" else info.db_type,
is_primary=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't id``be completely skipped here since it is alrady added to the schema above? Won't this add the id twice, once as idand once asdoc_id`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I'll only keep doc_id

name=self._db_config.collection_name,
schema=CollectionSchema(
fields=fields,
description="Collection Schema",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the description could also be passes as a config?

f"Index '{self._db_config.index_name}' has been successfully created"
)

def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for implementing index() instead of implementing _index() and utilizing index() from the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my new implementation, the index() function would also do the serialization (there's a column for serialized data), so the index function need to be utilized

Comment on lines 239 to 240
The database can only store float vectors, so this method is used to convert
TensorFlow or PyTorch tensors to a format compatible with the database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class should already take care of torch, tf etc conversions. Are you sure you need to do this again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very sure how the base class handles conversions between different formats. The vector embedding is still stored in its raw format if we use the __getattr__ method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found base class' implementation. I'll remove my implementation here. Thanks for your suggestion!

Comment on lines 252 to 254
elif torch.is_tensor(column_value):
return column_value.float().numpy().tolist()
elif tf.is_tensor(column_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe, because not every user has torch and/or tf installed. Therefore we can't directly use torch/tf. Here you can see how that is handled in other places.

],
)

return DocList[self._schema]([self._schema(**ret[i]) for i in range(len(ret))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to cast the output to DocList, the base class does it for you

Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
@@ -34,6 +34,9 @@ def __getattr__(name: str):
elif name == 'WeaviateDocumentIndex':
import_library('weaviate', raise_error=True)
import docarray.index.backends.weaviate as lib
elif name == "MilvusDocumentIndex":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes please

utility,
)
else:
hnswlib = import_library('hnswlib', raise_error=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this library?

self._create_collection_name()
self._collection = self._init_index()
self._build_index()
self._logger.info(f"{self.__class__.__name__} has been initialized")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use single quotes


torch_available, tf_available = is_torch_available(), is_tf_available()

if torch_available:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to import these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because when we perform a vector search, it requires the input in Python list format. If the user's input tensor is in torch/tensorflow format, we need to perform a conversion.

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make sure subindex is covered

Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
@jupyterjazz jupyterjazz marked this pull request as ready for review July 2, 2023 19:16
@jupyterjazz
Copy link
Contributor

Great work @maxwelljin, I'll take over from here!

@jupyterjazz jupyterjazz changed the base branch from main to feat-support-milvus July 2, 2023 19:17
@jupyterjazz jupyterjazz merged commit 0e24ce9 into docarray:feat-support-milvus Jul 2, 2023
@jupyterjazz jupyterjazz mentioned this pull request Jul 2, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: please add milvus as storage backend to docarray v2
4 participants