-
Notifications
You must be signed in to change notification settings - Fork 233
feat: support milvus #1666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support milvus #1666
Conversation
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are on a good path here! Just a few comments
docarray/index/backends/milvus.py
Outdated
|
||
def _init_index(self) -> Collection: | ||
if not utility.has_collection(self._db_config.collection_name): | ||
print(self._db_config.collection_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to remove the print
docarray/index/backends/milvus.py
Outdated
name="doc_id" if column_name == "id" else column_name, | ||
dtype=DataType.VARCHAR if column_name == "id" else info.db_type, | ||
is_primary=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't id``be completely skipped here since it is alrady added to the schema above? Won't this add the id twice, once as
idand once as
doc_id`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I'll only keep doc_id
docarray/index/backends/milvus.py
Outdated
name=self._db_config.collection_name, | ||
schema=CollectionSchema( | ||
fields=fields, | ||
description="Collection Schema", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the description could also be passes as a config?
f"Index '{self._db_config.index_name}' has been successfully created" | ||
) | ||
|
||
def index(self, docs: Union[BaseDoc, Sequence[BaseDoc]], **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for implementing index()
instead of implementing _index()
and utilizing index()
from the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my new implementation, the index()
function would also do the serialization (there's a column for serialized data), so the index function need to be utilized
docarray/index/backends/milvus.py
Outdated
The database can only store float vectors, so this method is used to convert | ||
TensorFlow or PyTorch tensors to a format compatible with the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base class should already take care of torch, tf etc conversions. Are you sure you need to do this again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very sure how the base class handles conversions between different formats. The vector embedding is still stored in its raw format if we use the __getattr__
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found base class' implementation. I'll remove my implementation here. Thanks for your suggestion!
docarray/index/backends/milvus.py
Outdated
elif torch.is_tensor(column_value): | ||
return column_value.float().numpy().tolist() | ||
elif tf.is_tensor(column_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe, because not every user has torch and/or tf installed. Therefore we can't directly use torch/tf. Here you can see how that is handled in other places.
docarray/index/backends/milvus.py
Outdated
], | ||
) | ||
|
||
return DocList[self._schema]([self._schema(**ret[i]) for i in range(len(ret))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to cast the output to DocList, the base class does it for you
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
docarray/index/__init__.py
Outdated
@@ -34,6 +34,9 @@ def __getattr__(name: str): | |||
elif name == 'WeaviateDocumentIndex': | |||
import_library('weaviate', raise_error=True) | |||
import docarray.index.backends.weaviate as lib | |||
elif name == "MilvusDocumentIndex": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single quotes please
docarray/index/backends/milvus.py
Outdated
utility, | ||
) | ||
else: | ||
hnswlib = import_library('hnswlib', raise_error=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this library?
docarray/index/backends/milvus.py
Outdated
self._create_collection_name() | ||
self._collection = self._init_index() | ||
self._build_index() | ||
self._logger.info(f"{self.__class__.__name__} has been initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use single quotes
docarray/index/backends/milvus.py
Outdated
|
||
torch_available, tf_available = is_torch_available(), is_tf_available() | ||
|
||
if torch_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to import these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because when we perform a vector search, it requires the input in Python list format. If the user's input tensor is in torch/tensorflow format, we need to perform a conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make sure subindex is covered
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Signed-off-by: maxwelljin2 <[email protected]>
Great work @maxwelljin, I'll take over from here! |
This PR aims to support Milvus as a storage backend