Question

我试图从给定表格中的每个列中选择不同的值。我的查询由于创建了许多MapReduce作业而缺乏性能，我正在寻找更好的解决方案。

我的表包含以下值：

last_30: a  
last_90: a, b, a    
last_180: b, c

所需的输出如下：

last_30#a  
last_90#a  
last_90#b   
last_180#b  
last_180#c

使用以下查询我得到了所需的输出，但它不是很高效，因为它循环遍历表几次：

SELECT distinct concat('last_30', exploded_last_30.key) 
FROM table
LATERAL VIEW explode(last_30) exploded_last_30 AS key
UNION ALL
SELECT distinct concat('last_90', exploded_last_90.key) 
FROM table
LATERAL VIEW explode(last_90) exploded_last_90 AS key
UNION ALL
SELECT distinct concat('last_180', exploded_last_180.key) 
FROM table
LATERAL VIEW explode(last_180) exploded_last_180 AS key

您能想到更快的方法来创建所需的输出吗？

迎接

:::更新:::

使用您的解决方案我想出了以下查询：

    select distinct *
    from (
        select explode( map_keys( map(
                                      concat('firstname#',a.exploded_firstname), '1', 
                                      concat('lastname#', a.exploded_lastname), '1', 
                                      concat('gender#', a.exploded_gender), '1',
                                      concat('last_30#', a.exploded_last_30), '1',
                                      concat('last_90#', a.exploded_last_90), '1'   
                                     ) 
                                )  
                      )
        from (
              select
                exploded_firstname.key as exploded_firstname, 
                exploded_lastname.key as exploded_lastname, 
                exploded_gender.key as exploded_gender,
                exploded_last_30.key as exploded_last_30,
                exploded_last_90.key as exploded_last_90
              from table
              LATERAL VIEW explode(firstname) exploded_firstname AS key, value
              LATERAL VIEW explode(lastname) exploded_lastname AS key, value
              LATERAL VIEW explode(gender) exploded_gender AS key, value
              LATERAL VIEW explode(last_30) exploded_last_30 AS key
              LATERAL VIEW explode(last_90) exploded_last_90 AS key
          ) as a 
      ) as b;

仍然面临两个问题：

我没有完整地描述问题，即样本数据I. 提供仅包括原始数据类型。在实际表格中也存在地图和数组。仅命中数组或映射包含＆＃39; NULL＆＃39;值将不返回任何输出

其次，添加更多字段此查询阻止编译器创建MapReduce作业执行请求。以下是14和15个字段的MapReduce时间分别为：

Total MapReduce CPU Time Spent: 26 seconds 60 msec
OK
Time taken: 142.896 seconds

Total MapReduce CPU Time Spent: 29 seconds 310 msec
OK
Time taken: 257.807 seconds

正如您所看到的，总的MapReduce时间近似为线性，而总时间增加则大大增加。你们对这两个问题有任何建议吗？

Answer 1

Union会强制多次读取表格。为了避免这种情况，您可以使用地图去除单行数据的重复数据，然后将其展开（这将转动您的数据）。对于重复数据删除，使用列值作为映射键，使用常量作为映射值。

如果行之间没有重复值，则这将是单个扫描操作：

  select explode( map_keys( map(concat(customer_id, '#', customer_fname), '1'
             , concat(customer_id, '#', customer_lname), '1'
             , concat(customer_id, '#', customer_email), '1'
             , concat(customer_id, '#', customer_street), '1'
             , concat(customer_id, '#', customer_city), '1'
             , concat(customer_id, '#', customer_state), '1'
             , concat(customer_id, '#', customer_zipcode), '1'
      ) ) ) from customers

如果有不同行产生的重复项，则添加distinct，但这将强制执行reduce阶段并且速度会变慢。

还有地图，可用于转动数据：D

从配置单元

1 个答案: