Question

运行以下Hive查询将返回特殊字符：

SELECT t6.amt amt2,t6.color color
FROM(
 SELECT t5.color color, t5.c1 amt
 FROM(
  SELECT t1.c1 c1, t1.c2 AS color 
  from(
   SELECT  7716 AS c1, "Red" AS c2 UNION 
   SELECT  6203 AS c1, "Blue" AS c2
  ) t1
 ) t5
order by color) t6
ORDER BY color

它将结果返回为

amt color
4   �
3   �

这是一个已知的蜂巢错误吗？

说明计划

    Map 5 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 4
         File Output Operator [FS_331359]
            compressed:false
            Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_331358]
            |  outputColumnNames:["_col0","_col1"]
            |  Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
            |<-Reducer 3 [SIMPLE_EDGE]
               Reduce Output Operator [RS_331357]
                  key expressions:_col1 (type: int)
                  sort order:+
                  Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                  value expressions:_col0 (type: string)
                  Select Operator [SEL_331351]
                     outputColumnNames:["_col0","_col1"]
                     Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                     Group By Operator [GBY_331350]
                     |  keys:KEY._col0 (type: int), KEY._col1 (type: string)
                     |  outputColumnNames:["_col0","_col1"]
                     |  Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                     |<-Union 2 [SIMPLE_EDGE]
                        |<-Map 1 [CONTAINS]
                        |  Reduce Output Operator [RS_331349]
                        |     key expressions:_col0 (type: int), _col1 (type: string)
                        |     Map-reduce partition columns:_col0 (type: int), _col1 (type: string)
                        |     sort order:++
                        |     Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                        |     Group By Operator [GBY_331348]
                        |        keys:_col0 (type: int), _col1 (type: string)
                        |        outputColumnNames:["_col0","_col1"]
                        |        Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                        |        Select Operator [SEL_331342]
                        |           outputColumnNames:["_col0","_col1"]
                        |           Statistics:Num rows: 1 Data size: 91 Basic stats: COMPLETE Column stats: COMPLETE
                        |           TableScan [TS_331341]
                        |              alias:_dummy_table
                        |              Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
                        |<-Map 5 [CONTAINS]
                           Reduce Output Operator [RS_331349]
                              key expressions:_col0 (type: int), _col1 (type: string)
                              Map-reduce partition columns:_col0 (type: int), _col1 (type: string)
                              sort order:++
                              Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                              Group By Operator [GBY_331348]
                                 keys:_col0 (type: int), _col1 (type: string)
                                 outputColumnNames:["_col0","_col1"]
                                 Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                                 Select Operator [SEL_331344]
                                    outputColumnNames:["_col0","_col1"]
                                    Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                                    TableScan [TS_331343]
                                       alias:_dummy_table
                                       Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

在这里禁用或启用配置参数可以帮助我吗？

如果我颠倒最外层选择中的列顺序，则查询将返回预期结果。我本来希望结果是

颜色amt

蓝色6203

红色7716

Answer 1

我在Hive 2.3上使用MR和Tez尝试了相同的查询，结果与您的相同。我关闭了所有查询优化，统计信息收集和rcp，但结果保持不变。问题是Hive在单个reducer上制作order by，并且由于您有两个连续的order by，因此Hive会将它们合并到单个reduce阶段（很容易看出您是外观还是扩展或格式化查询计划）。更准确地说，Hive使用_col0, _col1等作为列别名，在t5子查询中，您的键是_col0，但是在t6中，这是_col1，这就是选择运算符的原因您看到

expressions:: "_col1 (type: string), _col0 (type: int)"

并在reduce输出运算符中

key expressions:: "_col1 (type: int)"

因此，请介绍一些在交换选择列时如何切换键的类型。如果类型顺序在t5和t6中相同，则没有问题

key expressions:: "_col0 (type: string)"

如何避免这种情况-我真的不知道在单个reducer中进行顺序order by并不是因为进行了额外的优化。

蜂巢1.2 sql返回意外的特殊字符

1 个答案: