Hive Query无法正常工作

时间:2014-12-31 15:09:45

标签: hive hiveql

我试图让HiveQL等同于MySQL查询。

在MySQL中,我有一个这样的表:

CREATE TABLE votes(
 user_id INT UNSIGNED NOT NULL,
 list_id INT UNSIGNED NOT NULL,
 node_id INT UNSIGNED NOT NULL,
 direction ENUM('UP', 'DOWN') NOT NULL, 
 PRIMARY KEY (user_id, list_id, node_id)
) ENGINE=innodb;

我在Hive中创建了一个类似的表:

CREATE TABLE votes (
 user_id INT,
 list_id INT,
 node_id INT,
 direction STRING
) ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

我将MySQL表中的6行复制到Hive表中。所以在Hive,我得到了:

hive> SELECT * FROM votes;
OK
28      390     400058  "UP"
28      390     400059  "DOWN"
90113   390     400058  "DOWN"
90113   390     400059  "UP"
323694  390     400058  "UP"
323694  390     400059  "UP"
Time taken: 0.059 seconds, Fetched: 6 row(s)

以下语句在MySQL中运行良好:

SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id, 
SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu, 
SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud, 
SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du, 
SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;

输出:

390 400058  390 400058  2   0   0   1
390 400058  390 400059  1   1   1   0
390 400059  390 400058  1   1   1   0
390 400059  390 400059  2   0   0   1

然而,同样的陈述并未在Hive中给出正确的计数:

hive> SELECT v1.list_id AS lid, v1.node_id AS nid, v2.list_id AS rlid, v2.node_id AS rnid,
    > SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
    > SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
    > SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
    > SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
    > FROM votes v1
    > JOIN votes v2 ON v1.user_id=v2.user_id
    > GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;

...

Status: Finished successfully
OK
390     400058  390     400058  0       0       0       0
390     400058  390     400059  0       0       0       0
390     400059  390     400058  0       0       0       0
390     400059  390     400059  0       0       0       0
Time taken: 19.127 seconds, Fetched: 4 row(s)

我该如何解决这个问题?

另外,我找到了一个帖子,有人提到最好避免自我加入Hive。如果这是真的,你能解释为什么和更好的查询来实现我想要的东西吗?

2 个答案:

答案 0 :(得分:1)

我建议您在创建表格时use Hive CSVSerde。因此,在SELECT查询中双引号将自动处理,因为DEFAULT_QUOTE_CHARACTER在CSVSerde中为"

CREATE TABLE votes (user_id INT, list_id INT, node_id INT, direction STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
     WITH SERDEPROPERTIES ("separatorChar" = "\t") -- default seperator is ,
STORED AS CSVFILE;

运行SELECT查询

SELECT v1.list_id AS lid, v1.node_id AS nid, 
     v2.list_id AS rlid, v2.node_id AS rnid,
     SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
     SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
     SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
     SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
     FROM votes v1
     JOIN votes v2 ON v1.user_id=v2.user_id
     GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;


+------+---------+-------+---------+-----+-----+-----+-----+--+
| lid  |   nid   | rlid  |  rnid   | uu  | ud  | du  | dd  |
+------+---------+-------+---------+-----+-----+-----+-----+--+
| 390  | 400058  | 390   | 400058  | 2   | 0   | 0   | 1   |
| 390  | 400058  | 390   | 400059  | 1   | 1   | 1   | 0   |
| 390  | 400059  | 390   | 400058  | 1   | 1   | 1   | 0   |
| 390  | 400059  | 390   | 400059  | 2   | 0   | 0   | 1   |
+------+---------+-------+---------+-----+-----+-----+-----+--+

答案 1 :(得分:0)

看起来这些引号实际上是UP / DOWN值字符串的一部分,因此您需要将它们包含在比较语句中。我能够使用此Hive查询获得您期望的结果:

SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id,
  SUM(IF(v1.direction='"UP"' AND v2.direction='"UP"', 1, 0)) AS uu,
  SUM(IF(v1.direction='"UP"' AND v2.direction='"DOWN"', 1, 0)) AS ud,
  SUM(IF(v1.direction='"DOWN"' AND v2.direction='"UP"', 1, 0)) AS du,
  SUM(IF(v1.direction='"DOWN"' AND v2.direction='"DOWN"', 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;

请注意,UP / DOWN值现在用单引号括起来,以确保双引号被解释为值的一部分。