我有两个表:
项目
CREATE TABLE items (
ID int,
TXT string,
CODE string
);
INSERT INTO items VALUES (1,'AA BB CC','ZZ-100');
INSERT INTO items VALUES (2,'BB CC DD','ZZ-200');
INSERT INTO items VALUES (3,'AA CC EE','ZZ-300');
INSERT INTO items VALUES (4,'EE FF GG','ZZ-400');
INSERT INTO items VALUES (5,'CC HH II','ZZ-500');
+----+----------+--------+
| id | txt | code |
+----+----------+--------+
| 1 | AA BB CC | ZZ-100 |
| 2 | BB CC DD | ZZ-200 |
| 3 | AA CC EE | ZZ-300 |
| 4 | EE FF GG | ZZ-400 |
| 5 | CC HH II | ZZ-500 |
+----+----------+--------+
和 regex_table :
CREATE TABLE regex_table (
ID int,
REGEXSTR string,
CODE string
);
INSERT INTO regex_table VALUES(1,'AA','ZZ-100');
INSERT INTO regex_table VALUES(1,'CC','ZZ-100');
INSERT INTO regex_table VALUES(2,'AA','ZZ-100');
INSERT INTO regex_table VALUES(2,'BB','ZZ-200');
INSERT INTO regex_table VALUES(2,'CC','ZZ-200');
INSERT INTO regex_table VALUES(3,'DD','ZZ-100');
INSERT INTO regex_table VALUES(3,'DD','ZZ-300');
+----+----------+--------+
| id | regexstr | code |
+----+----------+--------+
| 1 | AA | ZZ-100 |
| 1 | CC | ZZ-100 |
| 2 | BB | ZZ-200 |
| 2 | AA | ZZ-100 |
| 2 | CC | ZZ-200 |
| 3 | DD | ZZ-100 |
| 3 | DD | ZZ-300 |
+----+----------+--------+
我想根据items.txt
和regex_table.regexstr
是否相等,用id
中的搜索字符串替换code
。
例如:
方案1:如果为id=1
,则code
为ZZ-100
,因此搜索字符串为AA|CC
:
SELECT id,regexp_replace(txt,'AA|CC','<NA>'),code from items where id=1;
+----+--------------------------------------+--------+
| id | regexp_replace(txt, 'aa|cc', '<na>') | code |
+----+--------------------------------------+--------+
| 1 | <NA> BB <NA> | ZZ-100 |
+----+--------------------------------------+--------+
方案2:如果为id=2
,则code
为ZZ-200
,因此搜索字符串为BB|CC
:
SELECT id,regexp_replace(txt,'BB|CC','<NA>'),code from items where id=2;
+----+--------------------------------------+--------+
| id | regexp_replace(txt, 'bb|cc', '<na>') | code |
+----+--------------------------------------+--------+
| 2 | <NA> <NA> DD | ZZ-200 |
+----+--------------------------------------+--------+
方案3:如果为id=4
,则code
为ZZ-300
,因此搜索字符串为DD
:
SELECT id,regexp_replace(txt,'DD','<NA>'),code from items where id=3;
+----+-----------------------------------+--------+
| id | regexp_replace(txt, 'dd', '<na>') | code |
+----+-----------------------------------+--------+
| 3 | AA CC EE | ZZ-300 |
+----+-----------------------------------+--------+
因此,基本上,搜索字符串必须是动态的,具体取决于来自另一个表的id
和code
。
是否可以在Impala(重要)和Hive(不太重要)中的一个查询中做到这一点?
注意:
id
和code
可以是动态的,并且可以添加到两个表中(因此无法将其硬编码为SQL)。必须对其进行查询。
我尝试避免执行JOIN
。我想知道是否有一种方法可以进行子查询。
一个想法是传递包含concat Regex搜索字符串的完整字符串,然后使用一些Regex技巧删除与该行无关的“ id”和“ code”。
更新1
我尝试过:
SELECT i.id, regexp_replace(txt, pattern, '<NA>'), i.code FROM items i INNER JOIN (SELECT id, group_concat('|', regexstr) AS pattern, regex_table.code FROM regex_table GROUP BY regex_table.id, regex_table.code) r ON r.id = i.id AND r.code = i.code;
得到了:
+----+----------------------------------------------+--------+
| id | regexp_replace(txt, pattern, '<na>') | code |
+----+----------------------------------------------+--------+
| 1 | <NA>A<NA>A<NA> <NA>B<NA>B<NA> <NA> | ZZ-100 |
| 3 | <NA>A<NA>A<NA> <NA>C<NA>C<NA> <NA>E<NA>E<NA> | ZZ-300 |
| 2 | <NA>B<NA>B<NA> <NA> <NA>D<NA>D<NA> | ZZ-200 |
+----+----------------------------------------------+--------+
更新2
我可以正常工作
SELECT o.id,
o.code,
items.txt,
o.regexstr,
IF(o.regexstr IS NOT NULL, regexp_replace(items.txt, o.regexstr,
'<NA>'), items.txt) masked
FROM items
LEFT JOIN (SELECT i.id id,
i.code code,
group_concat(r.regexstr, '|') regexstr
FROM items i
left join (SELECT id,
regexstr,
regex_table.code
FROM regex_table) r
ON r.id = i.id
AND r.code = i.code
GROUP BY i.id,
i.code) o
ON items.id = o.id
AND items.code = o.code;
输出:
+----+--------+----------+----------+--------------+
| id | code | txt | regexstr | masked |
+----+--------+----------+----------+--------------+
| 5 | ZZ-500 | CC HH II | NULL | CC HH II |
| 2 | ZZ-200 | BB CC DD | BB|CC | <NA> <NA> DD |
| 4 | ZZ-400 | EE FF GG | NULL | EE FF GG |
| 3 | ZZ-300 | AA CC EE | DD | AA CC EE |
| 1 | ZZ-100 | AA BB CC | CC|AA | <NA> BB <NA> |
+----+--------+----------+----------+--------------+
但是它看起来相当“复杂”。有什么想法可以使其更简洁吗?
答案 0 :(得分:1)
您可以使用CASE
表达式将所有内容汇总在一起:
SELECT
id,
CASE WHEN id = 1 THEN regexp_replace(txt, 'AA|CC', '<NA>')
WHEN id = 2 THEN regexp_replace(txt, 'BB|CC', '<NA>')
WHEN id = 3 THEN regexp_replace(txt, 'DD', '<NA>') END AS output
code
FROM items
WHERE id IN (1, 2, 3);
答案 1 :(得分:0)
SELECT o.id,
o.code,
items.txt,
o.regexstr,
IF(o.regexstr IS NOT NULL, regexp_replace(items.txt, o.regexstr,
'<NA>'), items.txt) masked
FROM items
LEFT JOIN (SELECT i.id id,
i.code code,
group_concat(r.regexstr, '|') regexstr
FROM items i
left join (SELECT id,
regexstr,
regex_table.code
FROM regex_table) r
ON r.id = i.id
AND r.code = i.code
GROUP BY i.id,
i.code) o
ON items.id = o.id
AND items.code = o.code;
输出:
+----+--------+----------+----------+--------------+
| id | code | txt | regexstr | masked |
+----+--------+----------+----------+--------------+
| 5 | ZZ-500 | CC HH II | NULL | CC HH II |
| 2 | ZZ-200 | BB CC DD | BB|CC | <NA> <NA> DD |
| 4 | ZZ-400 | EE FF GG | NULL | EE FF GG |
| 3 | ZZ-300 | AA CC EE | DD | AA CC EE |
| 1 | ZZ-100 | AA BB CC | CC|AA | <NA> BB <NA> |
+----+--------+----------+----------+--------------+