总结: 我觉得我的系统忽略了预先排序表的概念。 - 我希望在分拣步骤上节省时间,因为我正在使用 预先排序的数据,但查询计划似乎表明中间 排序步骤。
肮脏的细节如下:
设置=======
我设置了以下标志:=============
set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;
这里我创建了一个表来保存磁盘上的临时数据副本========
CREATE TABLE trades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
这里我将磁盘上的数据复制到表中 顺便说一句,这里的数据按符号聚类并按时间排序。 我似乎无法让Hive使用这个概念......即避免 再次排序
LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');
我使用以下决赛桌来强制执行分组=========== 并强加排序顺序===========
CREATE TABLE alltrades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
数据从配置单元表中加载==========
insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;
看到对所有需要的所有交易的查询都令人失望 排序的符号,时间再次排序......有没有办法 在这附近? 此外,有没有办法使整个过程在1个查询步骤中工作 而不是2?
为什么SORTING似乎不起作用=======
请注意,该表是使用sort by子句构造和填充的。 我担心放下这些会导致未来的减速器表现出来 好像不需要排序。
以下是我认为不应该进行查询的计划 涉及排序...但实际上是。========
hive> explain select symbol, time, price from alltrades sort by symbol, time;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
alltrades
TableScan
alias: alltrades
Select Operator
expressions:
expr: symbol
type: string
expr: time
type: string
expr: price
type: float
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
sort order: ++
tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: float
Reduce Operator Tree:
Extract
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
答案 0 :(得分:4)
您是否检查过set hive.enforce.bucketing=true
的效果?来自http://svn.apache.org/repos/asf/hive/branches/branch-0.7/conf/hive-default.xml
<property>
<name>hive.enforce.sorting</name>
<value>false</value>
<description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description>
</property>
您可能还会发现阅读org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer#genBucketingSortingDest
的实施有用:
答案 1 :(得分:4)
hive.enforce.bucketing
不会对数据集进行全局排序。相反,它会写入在桶中排序的数据(在您的情况下为8 /分区)。因此,它需要一个全局排序步骤来满足您正在寻找的查询。
希望这有帮助, 纳特
答案 2 :(得分:0)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
另请参阅https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy