Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyspark dataframe to ELWC format? #328

Open
zahraegh opened this issue Jul 21, 2022 · 1 comment
Open

pyspark dataframe to ELWC format? #328

zahraegh opened this issue Jul 21, 2022 · 1 comment

Comments

@zahraegh
Copy link

I have seen this solution on how to write in ELWC format, but this solution does not scale to huge data. Is there a way to do parallel write with tf.io.TFRecordWriter or any other solution that scales to big data?

@hengdashi
Copy link

hengdashi commented Oct 5, 2022

Hi, I'm not the maintainer of this package, but since I have some experience with it, I'll try to answer this question.

The way I've done this is to use the spark-tensorflow-connector (or you can use spark-tfrecord if you want partition by), and output the data frame into SequenceExample format. And when you build the dataset, you can use SEQ instead of ELWC in the dataset builder. (Note that to have the plugin correctly recognize whether a feature is an example feature, you need to wrap it into an array of array of some primitive types)

But, if you want ELWC format in spark, what I've also done instead is to construct ELWC examples by using apply function, and for each partition you write them into one tfrecord file and upload it to your cloud storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants