Question

我在hive中运行下面的代码，尝试在字段“word”上连接两个表。这是永远的，我想知道我能做些什么来加快速度。在一个表中，“word”字段混合使用大写和小写字母，而在另一个表中，它都是大写字母。

Code:

set hive.exec.compress.output=false;
set hive.mapred.mode=nonstrict;


DROP TABLE IF EXISTS newTable;
CREATE TABLE newTable AS
SELECT
      bh.inkey,
      bh.prid,
      bh.hname,
      bh.ptype,
      bh.band,
      bh.sles,
      bh.num_ducts,
      urg.R_NM

from table1 AS bh
INNER JOIN table2 AS urg
    ON LOWER(bh.word)=LOWER(urg.word);

Answer 1

我会创建一个基于table1的临时表，大写单词。然后将此表连接到table2而不使用任何字符串函数，因为table2.word都是大写的。除了table1和table2之外，字符串函数lower也会导致查询速度变慢。

DROP TABLE IF EXISTS tmpTable;
CREATE TABLE tmpTable AS
SELECT bh.word,
      bh.inkey,
      bh.prid,
      bh.hname,
      bh.ptype,
      bh.band,
      bh.sles,
      bh.num_ducts
from table1 AS bh;

DROP TABLE IF EXISTS newTable;
CREATE TABLE newTable AS
SELECT
      tmp.inkey,
      tmp.prid,
      tmp.hname,
      tmp.ptype,
      tmp.band,
      tmp.sles,
      tmp.num_ducts,
      urg.R_NM
from tmpTable AS tmp
INNER JOIN table2 AS urg
    ON tmp.word=urg.word;

Answer 2

我曾经遇到过一个问题，即我的连接条件太复杂了，而hive最终只使用了1个reducer来计算它。由于LOWER转换，它可能在这里相同。你能检查一下使用多少个减速器吗？

您可以使用CTE作为预备步骤执行LOWER并具有简单的连接条件：

CREATE TABLE newTable AS
with 
    table1_lower as (SELECT *, lower(word) as lword from table1),
    table2_lower as (SELECT *, lower(word) as lword from table2)
select
    bh.inkey,
    bh.prid,
    bh.hname,
    bh.ptype,
    bh.band,
    bh.sles,
    bh.num_ducts,
    urg.R_NM
from table1_lower AS bh
INNER JOIN table2_lower AS urg
    ON bh.lword=urg.lword;

加快hive连接字符串值

2 个答案: