[MAPREDUCE-7185] Parallelize part files move in FileOutputCommitter - ASF JIRA

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0, 2.9.2
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

3.5.0

Description

If map task outputs multiple files it could be slow to move them from temp directory to output directory in object stores (GCS, S3, etc).

To improve performance we need to parallelize move of more than 1 file in FileOutputCommitter.

Repro:
Start spark-shell:

spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=2

From spark-shell:

val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 10).repartition(50)
df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path" -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")

With the fix execution time reduces from 130 seconds to 50 seconds.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-7185.patch
11/Feb/19 23:03
4 kB
Igor Dvorzhak

Issue Links

Add Link

relates to

MAPREDUCE-7267 During commitJob, enable merge paths with multi threads

Open

Delete this link

Activity

People

Assignee:: Igor Dvorzhak

Reporter:: Igor Dvorzhak

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Feb/19 22:51

Updated:: 04/Jan/24 09:33

Agile

View on Board

Parallelize part files move in FileOutputCommitter

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment