使用hive / impala或其他方式通过子字符串连接大表的有效方法

时间:2017-06-14 07:37:26

标签: string hadoop join hive impala

我有两张表tabl1

+-------+--------+--------+----------+
| att1  |  att2  | att3   | att4     |
+-------+--------+--------+----------+
|  abcd | ava012 | df012f | afsdaldf |
.......

tabl2

+----+
| val|
+----+
| 012|
...

tabl2包含tabl1的4列中的一列或多列中的子字符串。 这两个表都是包含数百万条记录的大表。 我尝试连接tabl1列并在其中搜索,但查询永远不会结束。 有没有一种有效的方法来做到这一点。也许将整个表转换为一个txt文件并在其中搜索? 还关注this question 以下是我的试验的一些例子(都在Hive中):

SELECT a.*, b.*
from tabl1 a, tabl2 b
where  
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0

  SELECT a.*, b.*
    from tabl1 a, tabl2 b
    where  
    concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) 
like  concat ('%',(cast (b.val as string),'%')

还有一些REGEX,但无休止的运行时......

1 个答案:

答案 0 :(得分:1)

select  *

from           (select  *
                from    tabl1 t1
                        lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\\k<token>(?= |$))',''),' ')) e as val
                ) t1

        join    tabl2 t2

        on      t2.val = 
                t1.val