Skip to content

fix: protobuf (de)ser for docvec #1639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 13, 2023
Merged

fix: protobuf (de)ser for docvec #1639

merged 9 commits into from
Jun 13, 2023

Conversation

JohannesMessner
Copy link
Member

@JohannesMessner JohannesMessner commented Jun 12, 2023

closes #1561

The problem here was that deser from proto to docvec did not acutally deserialize the columns. Instead, it stored raw protos as columns instead of deserialized values.

A special case is when a column is None, and this PR introduces a convention of how those columns are represented in the proto. This avoid having to change the proto definition.

Limitation: Right now this only works for np columns, not tf or torch. Support for those two will come in a separate PR.

Breaking Change: The .proto definition is changed in this PR. I added ListOfDocVecProto to properly represent a DocVec's doc_vec_columns, and adjusted DocVecProto accordingly. This will break backwards compatibility for DocVecs serialiazed using older versions of docarray. But those older version were broken in that regard anyways, so it should not be an issue.

TODO:
None handling for

  • tensor columns
  • doc columns
  • doc_vec columns
  • any columns (not sure if this can even be None). Edit: Indeed it cannot be None, so it is already handled. Added test to make sure it works.

Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
# handle values that were None before serialization
tensor_columns[tens_col_name] = None
else:
# TODO(johannes): handle torch, tf, numpy
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will do this in a separate PR, it might require a proto change (but not sure)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it require a proto change. I think here you should look at the tensor type of this field in the doc type and use it to load the proto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same way you do it for doc_columns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss that in a separate PR

Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
@JohannesMessner JohannesMessner requested review from samsja and JoanFM June 13, 2023 12:58
Signed-off-by: Johannes Messner <[email protected]>
@JohannesMessner JohannesMessner marked this pull request as ready for review June 13, 2023 13:10
@github-actions
Copy link

📝 Docs are deployed on https://ft-fix-docvec-proto--jina-docs.netlify.app 🎉

Copy link
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, added on small comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: DocVec.to_protobuf()
2 participants