从配置单元

时间:2016-03-31 14:15:54

标签: hadoop mapreduce hive distinct

我试图从给定表格中的每个列中选择不同的值。我的查询由于创建了许多MapReduce作业而缺乏性能,我正在寻找更好的解决方案。

我的表包含以下值:

last_30: a  
last_90: a, b, a    
last_180: b, c 

所需的输出如下:

last_30#a  
last_90#a  
last_90#b   
last_180#b  
last_180#c

使用以下查询我得到了所需的输出,但它不是很高效,因为它循环遍历表几次:

SELECT distinct concat('last_30', exploded_last_30.key) 
FROM table
LATERAL VIEW explode(last_30) exploded_last_30 AS key
UNION ALL
SELECT distinct concat('last_90', exploded_last_90.key) 
FROM table
LATERAL VIEW explode(last_90) exploded_last_90 AS key
UNION ALL
SELECT distinct concat('last_180', exploded_last_180.key) 
FROM table
LATERAL VIEW explode(last_180) exploded_last_180 AS key

您能想到更快的方法来创建所需的输出吗?

迎接

:::更新:::

使用您的解决方案我想出了以下查询:

    select distinct *
    from (
        select explode( map_keys( map(
                                      concat('firstname#',a.exploded_firstname), '1', 
                                      concat('lastname#', a.exploded_lastname), '1', 
                                      concat('gender#', a.exploded_gender), '1',
                                      concat('last_30#', a.exploded_last_30), '1',
                                      concat('last_90#', a.exploded_last_90), '1'   
                                     ) 
                                )  
                      )
        from (
              select
                exploded_firstname.key as exploded_firstname, 
                exploded_lastname.key as exploded_lastname, 
                exploded_gender.key as exploded_gender,
                exploded_last_30.key as exploded_last_30,
                exploded_last_90.key as exploded_last_90
              from table
              LATERAL VIEW explode(firstname) exploded_firstname AS key, value
              LATERAL VIEW explode(lastname) exploded_lastname AS key, value
              LATERAL VIEW explode(gender) exploded_gender AS key, value
              LATERAL VIEW explode(last_30) exploded_last_30 AS key
              LATERAL VIEW explode(last_90) exploded_last_90 AS key
          ) as a 
      ) as b;

仍然面临两个问题:

  • 我没有完整地描述问题,即样本数据I. 提供仅包括原始数据类型。在实际表格中也存在地图和数组。仅命中数组或映射 包含' NULL'值将不返回任何输出
  • 其次,添加更多字段 此查询阻止编译器创建MapReduce作业 执行请求。以下是14和15个字段的MapReduce时间 分别为:

    Total MapReduce CPU Time Spent: 26 seconds 60 msec
    OK
    Time taken: 142.896 seconds
    
    Total MapReduce CPU Time Spent: 29 seconds 310 msec
    OK
    Time taken: 257.807 seconds
    

正如您所看到的,总的MapReduce时间近似为线性,而总时间增加则大大增加。你们对这两个问题有任何建议吗?

1 个答案:

答案 0 :(得分:0)

Union会强制多次读取表格。为了避免这种情况,您可以使用地图去除单行数据的重复数据,然后将其展开(这将转动您的数据)。对于重复数据删除,使用列值作为映射键,使用常量作为映射值。

如果行之间没有重复值,则这将是单个扫描操作:

  select explode( map_keys( map(concat(customer_id, '#', customer_fname), '1'
             , concat(customer_id, '#', customer_lname), '1'
             , concat(customer_id, '#', customer_email), '1'
             , concat(customer_id, '#', customer_street), '1'
             , concat(customer_id, '#', customer_city), '1'
             , concat(customer_id, '#', customer_state), '1'
             , concat(customer_id, '#', customer_zipcode), '1'
      ) ) ) from customers 

如果有不同行产生的重复项,则添加distinct,但这将强制执行reduce阶段并且速度会变慢。

还有地图,可用于转动数据:D