我试图让HiveQL等同于MySQL查询。
在MySQL中,我有一个这样的表:
CREATE TABLE votes(
user_id INT UNSIGNED NOT NULL,
list_id INT UNSIGNED NOT NULL,
node_id INT UNSIGNED NOT NULL,
direction ENUM('UP', 'DOWN') NOT NULL,
PRIMARY KEY (user_id, list_id, node_id)
) ENGINE=innodb;
我在Hive中创建了一个类似的表:
CREATE TABLE votes (
user_id INT,
list_id INT,
node_id INT,
direction STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
我将MySQL表中的6行复制到Hive表中。所以在Hive,我得到了:
hive> SELECT * FROM votes;
OK
28 390 400058 "UP"
28 390 400059 "DOWN"
90113 390 400058 "DOWN"
90113 390 400059 "UP"
323694 390 400058 "UP"
323694 390 400059 "UP"
Time taken: 0.059 seconds, Fetched: 6 row(s)
以下语句在MySQL中运行良好:
SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id,
SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
输出:
390 400058 390 400058 2 0 0 1
390 400058 390 400059 1 1 1 0
390 400059 390 400058 1 1 1 0
390 400059 390 400059 2 0 0 1
然而,同样的陈述并未在Hive中给出正确的计数:
hive> SELECT v1.list_id AS lid, v1.node_id AS nid, v2.list_id AS rlid, v2.node_id AS rnid,
> SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
> SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
> SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
> SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
> FROM votes v1
> JOIN votes v2 ON v1.user_id=v2.user_id
> GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
...
Status: Finished successfully
OK
390 400058 390 400058 0 0 0 0
390 400058 390 400059 0 0 0 0
390 400059 390 400058 0 0 0 0
390 400059 390 400059 0 0 0 0
Time taken: 19.127 seconds, Fetched: 4 row(s)
我该如何解决这个问题?
另外,我找到了一个帖子,有人提到最好避免自我加入Hive。如果这是真的,你能解释为什么和更好的查询来实现我想要的东西吗?
答案 0 :(得分:1)
我建议您在创建表格时use Hive CSVSerde
。因此,在SELECT查询中双引号将自动处理,因为DEFAULT_QUOTE_CHARACTER
在CSVSerde中为"
。
CREATE TABLE votes (user_id INT, list_id INT, node_id INT, direction STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = "\t") -- default seperator is ,
STORED AS CSVFILE;
运行SELECT查询
SELECT v1.list_id AS lid, v1.node_id AS nid,
v2.list_id AS rlid, v2.node_id AS rnid,
SUM(IF(v1.direction="UP" AND v2.direction="UP", 1, 0)) AS uu,
SUM(IF(v1.direction="UP" AND v2.direction="DOWN", 1, 0)) AS ud,
SUM(IF(v1.direction="DOWN" AND v2.direction="UP", 1, 0)) AS du,
SUM(IF(v1.direction="DOWN" AND v2.direction="DOWN", 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
+------+---------+-------+---------+-----+-----+-----+-----+--+
| lid | nid | rlid | rnid | uu | ud | du | dd |
+------+---------+-------+---------+-----+-----+-----+-----+--+
| 390 | 400058 | 390 | 400058 | 2 | 0 | 0 | 1 |
| 390 | 400058 | 390 | 400059 | 1 | 1 | 1 | 0 |
| 390 | 400059 | 390 | 400058 | 1 | 1 | 1 | 0 |
| 390 | 400059 | 390 | 400059 | 2 | 0 | 0 | 1 |
+------+---------+-------+---------+-----+-----+-----+-----+--+
答案 1 :(得分:0)
看起来这些引号实际上是UP / DOWN值字符串的一部分,因此您需要将它们包含在比较语句中。我能够使用此Hive查询获得您期望的结果:
SELECT v1.list_id, v1.node_id, v2.list_id, v2.node_id,
SUM(IF(v1.direction='"UP"' AND v2.direction='"UP"', 1, 0)) AS uu,
SUM(IF(v1.direction='"UP"' AND v2.direction='"DOWN"', 1, 0)) AS ud,
SUM(IF(v1.direction='"DOWN"' AND v2.direction='"UP"', 1, 0)) AS du,
SUM(IF(v1.direction='"DOWN"' AND v2.direction='"DOWN"', 1, 0)) AS dd
FROM votes v1
JOIN votes v2 ON v1.user_id=v2.user_id
GROUP BY v1.list_id, v1.node_id, v2.list_id, v2.node_id;
请注意,UP / DOWN值现在用单引号括起来,以确保双引号被解释为值的一部分。