我试图从给定表格中的每个列中选择不同的值。我的查询由于创建了许多MapReduce作业而缺乏性能,我正在寻找更好的解决方案。
我的表包含以下值:
last_30: a
last_90: a, b, a
last_180: b, c
所需的输出如下:
last_30#a
last_90#a
last_90#b
last_180#b
last_180#c
使用以下查询我得到了所需的输出,但它不是很高效,因为它循环遍历表几次:
SELECT distinct concat('last_30', exploded_last_30.key)
FROM table
LATERAL VIEW explode(last_30) exploded_last_30 AS key
UNION ALL
SELECT distinct concat('last_90', exploded_last_90.key)
FROM table
LATERAL VIEW explode(last_90) exploded_last_90 AS key
UNION ALL
SELECT distinct concat('last_180', exploded_last_180.key)
FROM table
LATERAL VIEW explode(last_180) exploded_last_180 AS key
您能想到更快的方法来创建所需的输出吗?
迎接
:::更新:::
使用您的解决方案我想出了以下查询:
select distinct *
from (
select explode( map_keys( map(
concat('firstname#',a.exploded_firstname), '1',
concat('lastname#', a.exploded_lastname), '1',
concat('gender#', a.exploded_gender), '1',
concat('last_30#', a.exploded_last_30), '1',
concat('last_90#', a.exploded_last_90), '1'
)
)
)
from (
select
exploded_firstname.key as exploded_firstname,
exploded_lastname.key as exploded_lastname,
exploded_gender.key as exploded_gender,
exploded_last_30.key as exploded_last_30,
exploded_last_90.key as exploded_last_90
from table
LATERAL VIEW explode(firstname) exploded_firstname AS key, value
LATERAL VIEW explode(lastname) exploded_lastname AS key, value
LATERAL VIEW explode(gender) exploded_gender AS key, value
LATERAL VIEW explode(last_30) exploded_last_30 AS key
LATERAL VIEW explode(last_90) exploded_last_90 AS key
) as a
) as b;
仍然面临两个问题:
其次,添加更多字段 此查询阻止编译器创建MapReduce作业 执行请求。以下是14和15个字段的MapReduce时间 分别为:
Total MapReduce CPU Time Spent: 26 seconds 60 msec
OK
Time taken: 142.896 seconds
Total MapReduce CPU Time Spent: 29 seconds 310 msec
OK
Time taken: 257.807 seconds
正如您所看到的,总的MapReduce时间近似为线性,而总时间增加则大大增加。你们对这两个问题有任何建议吗?
答案 0 :(得分:0)
Union会强制多次读取表格。为了避免这种情况,您可以使用地图去除单行数据的重复数据,然后将其展开(这将转动您的数据)。对于重复数据删除,使用列值作为映射键,使用常量作为映射值。
如果行之间没有重复值,则这将是单个扫描操作:
select explode( map_keys( map(concat(customer_id, '#', customer_fname), '1'
, concat(customer_id, '#', customer_lname), '1'
, concat(customer_id, '#', customer_email), '1'
, concat(customer_id, '#', customer_street), '1'
, concat(customer_id, '#', customer_city), '1'
, concat(customer_id, '#', customer_state), '1'
, concat(customer_id, '#', customer_zipcode), '1'
) ) ) from customers
如果有不同行产生的重复项,则添加distinct,但这将强制执行reduce阶段并且速度会变慢。
还有地图,可用于转动数据:D