Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39910

DataFrameReader API cannot read files from hadoop archives (.har)

    XMLWordPrintableJSON

Details

    Description

      Reading a file from an hadoop archive using the DataFrameReader API returns an empty Dataset:

      scala> val df = spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719")
      df: org.apache.spark.sql.Dataset[String] = [value: string]
      scala> df.count
      res7: Long = 0 

       

      On the other hand, reading the same file, from the same hadoop archive, but using the RDD API yields the correct result:

      scala> val df = sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value")
      df: org.apache.spark.sql.DataFrame = [value: string]
      scala> df.count
      res8: Long = 5589 

      Attachments

        Issue Links

          Activity

            People

              preaudc Christophe Préaud
              preaudc Christophe Préaud
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: