Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20187

Incorrect query results in hive when hive.convert.join.bucket.mapjoin.tez is set to true

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.1.0
    • None
    • None
    • None
    • Hive 3.0 and Tez 0.91

    Description

      When hive.convert.join.bucket.mapjoin.tez=true and bucketed column is in select clause but not in where clause, hive is performing a bucket map join and returning incorrect results. When the bucketed column is removed from select clause or  hive.convert.join.bucket.mapjoin.tez=false, returned query results are correct.

       

      create table my_fact(AMT decimal(20,3),bucket_col string ,join_col string )

      PARTITIONED BY (FISCAL_YEAR string ,ACCOUNTING_PERIOD string )
      CLUSTERED BY (bucket_col) INTO 10
      BUCKETS
      stored as ORC
      ;
      create table my_dim(join_col string,filter_col string) stored as orc;

      After populating and analyzing above tables, explain  plan looks as below when  hive.convert.join.bucket.mapjoin.tez=TRUE:

       

      explain  select T4.join_col as account1,my_fact.accounting_period
      FROM my_fact JOIN my_dim T4 ON my_fact.join_col = T4.join_col
      WHERE my_fact.fiscal_year = '2015'
      AND T4.filter_col IN ( 'VAL1', 'VAL2' )
      and my_fact.accounting_period in (10);

      Vertex dependency in root stage
      Map 1 <- Map 2 (CUSTOM_EDGE)

      Stage-0
      Fetch Operator
      limit:-1
      Stage-1
      Map 1 vectorized, llap
      File Output Operator [FS_24]
      Select Operator [SEL_23] (rows=15282589 width=291)
      Output:["_col0","_col1","_col2"]
      Map Join Operator [MAPJOIN_22] (rows=15282589 width=291)
      BucketMapJoin:true,Conds:SEL_21._col1=RS_19._col0(Inner),Output:["_col0","_col3","_col4"]
      <-Map 2 [CUSTOM_EDGE] vectorized, llap
      MULTICAST [RS_19]
      PartitionCols:_col0
      Select Operator [SEL_18] (rows=818 width=186)
      Output:["_col0"]
      Filter Operator [FIL_17] (rows=818 width=186)
      predicate(filter_col) IN ('VAL1', 'VAL2') and join_col is not null)
      TableScan [TS_3] (rows=1635 width=186)
      default@my_dim,t4,Tbl:COMPLETE,Col:NONE,Output:["join_col","filter_col"]
      <-Select Operator [SEL_21] (rows=13893263 width=291)
      Output:["_col0","_col1","_col3"]
      Filter Operator [FIL_20] (rows=13893263 width=291)
      predicate:join_col is not null
      TableScan [TS_0] (rows=13893263 width=291)
      default@my_fact,my_fact,Tbl:COMPLETE,Col:NONE,Output:["bucket_col","join_col"]

      extended_explain.txt has more detailed plan.

      When  hive.convert.join.bucket.mapjoin.tez=false,  plan no longer has bucketjoin and query results are correct.

      Vertex dependency in root stage
      Map 1 <- Map 2 (BROADCAST_EDGE)

      Stage-0
      Fetch Operator
      limit:-1
      Stage-1
      Map 1 vectorized, llap
      File Output Operator [FS_24]
      Select Operator [SEL_23] (rows=15282589 width=291)
      Output:["_col0","_col1","_col2"]
      Map Join Operator [MAPJOIN_22] (rows=15282589 width=291)
      Conds:SEL_21._col1=RS_19._col0(Inner),Output:["_col0","_col3","_col4"]
      <-Map 2 [BROADCAST_EDGE] vectorized, llap
      BROADCAST [RS_19]
      PartitionCols:_col0
      Select Operator [SEL_18] (rows=818 width=186)
      Output:["_col0"]
      Filter Operator [FIL_17] (rows=818 width=186)
      predicate(filter_col) IN ('VAL1', 'VAL2') and join_col is not null)
      TableScan [TS_3] (rows=1635 width=186)
      default@my_dim,t4,Tbl:COMPLETE,Col:NONE,Output:["join_col","filter_col"]
      <-Select Operator [SEL_21] (rows=13893263 width=291)
      Output:["_col0","_col1","_col3"]
      Filter Operator [FIL_20] (rows=13893263 width=291)
      predicate:join_col is not null
      TableScan [TS_0] (rows=13893263 width=291)
      default@my_fact,my_fact,Tbl:COMPLETE,Col:NONE,Output:

       

      Per suggestion in bug# TEZ-3971, creating this case under Hive project.

       

       

      Attachments

        1. extended_explain.txt
          12 kB
          Karthik
        2. HIVE-20187.1.patch
          4 kB
          Jaume M
        3. HIVE-20187.2.patch
          16 kB
          Deepak Jaiswal
        4. HIVE-20187.3.patch
          39 kB
          Deepak Jaiswal

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            djaiswal Deepak Jaiswal Assign to me
            Palaniappan Karthik
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment