在Hive中使用排序表

时间:2011-08-03 23:01:20

标签: hadoop hive

总结: 我觉得我的系统忽略了预先排序表的概念。 - 我希望在分拣步骤上节省时间,因为我正在使用 预先排序的数据,但查询计划似乎表明中间 排序步骤。

肮脏的细节如下:

设置=======

我设置了以下标志:=============

set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;

这里我创建了一个表来保存磁盘上的临时数据副本========

CREATE TABLE trades
      (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

这里我将磁盘上的数据复制到表中 顺便说一句,这里的数据按符号聚类并按时间排序。 我似乎无法让Hive使用这个概念......即避免 再次排序

LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');

我使用以下决赛桌来强制执行分组=========== 并强加排序顺序===========

CREATE TABLE alltrades
      (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE;

数据从配置单元表中加载==========

insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;

看到对所有需要的所有交易的查询都令人失望 排序的符号,时间再次排序......有没有办法 在这附近? 此外,有没有办法使整个过程在1个查询步骤中工作 而不是2?

为什么SORTING似乎不起作用=======

请注意,该表是使用sort by子句构造和填充的。 我担心放下这些会导致未来的减速器表现出来 好像不需要排序。

以下是我认为不应该进行查询的计划 涉及排序...但实际上是。========

hive> explain select symbol, time, price from alltrades sort by symbol, time;
OK
ABSTRACT SYNTAX TREE:
 (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))

STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-0 is a root stage

STAGE PLANS:
 Stage: Stage-1
   Map Reduce
     Alias -> Map Operator Tree:
       alltrades
         TableScan
           alias: alltrades
           Select Operator
             expressions:
                   expr: symbol
                   type: string
                   expr: time
                   type: string
                   expr: price
                   type: float
             outputColumnNames: _col0, _col1, _col2
             Reduce Output Operator
               key expressions:
                     expr: _col0
                     type: string
                     expr: _col1
                     type: string
               sort order: ++
               tag: -1
               value expressions:
                     expr: _col0
                     type: string
                     expr: _col1
                     type: string
                     expr: _col2
                     type: float
     Reduce Operator Tree:
       Extract
         File Output Operator
           compressed: false
           GlobalTableId: 0
           table:
               input format: org.apache.hadoop.mapred.TextInputFormat
               output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

 Stage: Stage-0
   Fetch Operator
     limit: -1

3 个答案:

答案 0 :(得分:4)

您是否检查过set hive.enforce.bucketing=true的效果?来自http://svn.apache.org/repos/asf/hive/branches/branch-0.7/conf/hive-default.xml

<property>
  <name>hive.enforce.sorting</name>
  <value>false</value>
  <description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description>
</property>

您可能还会发现阅读org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer#genBucketingSortingDest的实施有用:

http://svn.apache.org/repos/asf/hive/branches/branch-0.7/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

答案 1 :(得分:4)

hive.enforce.bucketing不会对数据集进行全局排序。相反,它会写入在桶中排序的数据(在您的情况下为8 /分区)。因此,它需要一个全局排序步骤来满足您正在寻找的查询。

希望这有帮助, 纳特

答案 2 :(得分:0)

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

The CLUSTERED BY and SORTED BY creation commands do not affect how
data is inserted into a table – only how it is read. This means that
users must be careful to insert data correctly by specifying the
number of reducers to be equal to the number of buckets, and using
CLUSTER BY and SORT BY commands in their query.

另请参阅https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy