Skip to content

feat: i/o for DocVec #1562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 28, 2023
Merged

feat: i/o for DocVec #1562

merged 26 commits into from
Jun 28, 2023

Conversation

JohannesMessner
Copy link
Member

@JohannesMessner JohannesMessner commented May 22, 2023

This PR implements the IOMixinArray for DocVec, enabling the following (de)serializations:

Depends on #1561 (protobuf serialization bug)

JSON data now contains four columns, like this:

{"tensor_columns":{"tens":[[0.18736880416567359,0.9418134463574087,0.23608414651162635,0.200237847284912,0.714415793571785],[0.2980003926683161,0.9763157895635218,0.605654616422409,0.3988857003524561,0.9359414534852715],[0.6382713675712178,0.7725828308842624,0.1442529379493993,0.9327649675846935,0.7312168445052849],[0.07432756375020888,0.84017563415619,0.895668448511279,0.636209667658674,0.299134816130166],[0.1693478076528051,0.039113459980241405,0.3005636243582387,0.06480147049887997,0.09461133654601517]],"tens_none":null},"doc_columns":{"inner":{"tensor_columns":{"tens":[[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915]]},"doc_columns":{},"docs_vec_columns":{"tens":[[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915],[0.27449474647200733,0.10957290479279247,0.6840416906244843,0.02922130361087716,0.1610742166462915]]},"any_columns":{"id":["1fa4159e51fdfa029b2e2bbb45be0e58","1fa4159e51fdfa029b2e2bbb45be0e58","1fa4159e51fdfa029b2e2bbb45be0e58","1fa4159e51fdfa029b2e2bbb45be0e58","1fa4159e51fdfa029b2e2bbb45be0e58"]}},"inner_none":null},"docs_vec_columns":{"tens":[[0.18736880416567359,0.9418134463574087,0.23608414651162635,0.200237847284912,0.714415793571785],[0.2980003926683161,0.9763157895635218,0.605654616422409,0.3988857003524561,0.9359414534852715],[0.6382713675712178,0.7725828308842624,0.1442529379493993,0.9327649675846935,0.7312168445052849],[0.07432756375020888,0.84017563415619,0.895668448511279,0.636209667658674,0.299134816130166],[0.1693478076528051,0.039113459980241405,0.3005636243582387,0.06480147049887997,0.09461133654601517]],"tens_none":null},"any_columns":{"id":["5f2575c95a52380be1a9a68e26cd19c5","a5015ccad05d1b7ead1dce92650e5e8f","f14b45ea6384f235f990f40cddf39444","fe0004643fb7aadb7813438727f42dd1","489f382c9887c3fe3ef075a735722ff4"],"text":["0","1","2","3","4"],"num":[null,null,null,null,null]}}

CSV is not supported, since it is a row-based format.

All other formats are used the same way as in DocList, but have a different serialized representation.

Closes #1330

Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
@JohannesMessner JohannesMessner requested review from samsja and JoanFM June 14, 2023 15:35
@JohannesMessner JohannesMessner marked this pull request as ready for review June 14, 2023 15:35
@JohannesMessner JohannesMessner marked this pull request as draft June 14, 2023 15:36
Copy link
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets discuss I don't think we should implement this that way.

If I am understanding correctly here the json representation of a DocList and a DocVec is the same right ?

IMO DocVec should have its own column based json representation.

For csv etcc I would say we ask first user to instantiate a DocList then cast to docVec

@JohannesMessner
Copy link
Member Author

IMO DocVec should have its own column based json representation.

Fair

For csv etcc I would say we ask first user to instantiate a DocList then cast to docVec

Why? The whole point of this issue is to natively enable these, no? Why do you think DocVec should not have these methods?

@samsja
Copy link
Member

samsja commented Jun 19, 2023

IMO DocVec should have its own column based json representation.

Fair

For csv etcc I would say we ask first user to instantiate a DocList then cast to docVec

Why? The whole point of this issue is to natively enable these, no? Why do you think DocVec should not have these methods?

My point is that saving line by line in csv for instance is a bit weird for a column base format. But maybe yes we can have them, I just don't want it to be confusing

@JohannesMessner
Copy link
Member Author

IMO DocVec should have its own column based json representation.

Fair

For csv etcc I would say we ask first user to instantiate a DocList then cast to docVec

Why? The whole point of this issue is to natively enable these, no? Why do you think DocVec should not have these methods?

My point is that saving line by line in csv for instance is a bit weird for a column base format. But maybe yes we can have them, I just don't want it to be confusing

yeah this is fair, I will address that

@JohannesMessner JohannesMessner requested a review from samsja June 27, 2023 13:13
@JohannesMessner JohannesMessner marked this pull request as ready for review June 27, 2023 13:13
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice!

@JohannesMessner JohannesMessner merged commit 8f25887 into main Jun 28, 2023
@JohannesMessner JohannesMessner deleted the feat-docvec-io branch June 28, 2023 07:18
@github-actions
Copy link

📝 Docs are deployed on https://ft-feat-docvec-io--jina-docs.netlify.app 🎉

@JoanFM JoanFM mentioned this pull request Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IOMixin for DocVec (to/from)
3 participants