[SPARK-39910] DataFrameReader API cannot read files from hadoop archives (.har) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.3, 3.1.3, 3.3.0, 3.2.2
Fix Version/s: 4.0.0, 3.5.1
Component/s: SQL
Labels:
- DataFrameReader
- pull-request-available

Description

Reading a file from an hadoop archive using the DataFrameReader API returns an empty Dataset:

scala> val df = spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
df: org.apache.spark.sql.Dataset[String] = [value: string]
scala> df.count
res7: Long = 0

On the other hand, reading the same file, from the same hadoop archive, but using the RDD API yields the correct result:

scala> val df = sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.count
res8: Long = 5589

Attachments

Issue Links

contains

SPARK-26631 Issue while reading Parquet data from Hadoop Archive files (.har)

Resolved

links to

GitHub Pull Request #43463

Activity

People

Assignee:: Christophe Préaud

Reporter:: Christophe Préaud

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Jul/22 12:52

Updated:: 08/Feb/24 12:30

Resolved:: 08/Feb/24 12:30