我有两张表tabl1
:
+-------+--------+--------+----------+
| att1 | att2 | att3 | att4 |
+-------+--------+--------+----------+
| abcd | ava012 | df012f | afsdaldf |
.......
和tabl2
:
+----+
| val|
+----+
| 012|
...
tabl2
包含tabl1
的4列中的一列或多列中的子字符串。
这两个表都是包含数百万条记录的大表。
我尝试连接tabl1
列并在其中搜索,但查询永远不会结束。
有没有一种有效的方法来做到这一点。也许将整个表转换为一个txt
文件并在其中搜索?
还关注this question
以下是我的试验的一些例子(都在Hive中):
SELECT a.*, b.*
from tabl1 a, tabl2 b
where
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string),
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0
或
SELECT a.*, b.*
from tabl1 a, tabl2 b
where
concat ( (cast (a.att1 as string), (cast (a.att2 as string),
(cast (a.att3 as string), (cast (a.att4 as string) )
like concat ('%',(cast (b.val as string),'%')
还有一些REGEX
,但无休止的运行时......
答案 0 :(得分:1)
select *
from (select *
from tabl1 t1
lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\\k<token>(?= |$))',''),' ')) e as val
) t1
join tabl2 t2
on t2.val =
t1.val