我有一些记录包含一些行李作为字段,我正在尝试将行李合并为具有相同字段的记录(我正在丢弃某些字段)。
数据类似于:
u08 u08an id {(web)} 0 0 {(GB),(US)} an
u08 u08an id {(ars)} 0 0 {(GB),(RU)} an
u09 u09an id {(web)} 0 0 {(GB)} an
u09 u09an id {(web)} 0 0 {(US)} an
u10 u10an id {(web)} 0 0 {(GB)} an
u10 u10an id {(ars)} 0 0 {(GB)} an
u11 u11an id {(web)} 0 0 {(GB)} an
u11 u11an id {(web)} 0 0 {(GB)} an
我希望获得(在丢弃不相关的字段和改组之后):
u08 u08an an {(GB),(US),(RU)}
u09 u09an an {(GB),(US)}
u10 u10an an {(GB)}
u11 u11an an {(GB)}
使用以下架构加载输入:
user_identities = LOAD '$partner_user_identities_location' AS (
user_id: chararray,
partner_user_id: chararray,
partner_user_id_type: chararray,
sync_types: bag{tuple(chararray)},
synced_first_timestamp: double,
synced_last_timestamp: double,
country_codes: bag{tuple(chararray)},
partner_id: chararray
);
如果我合并sync_types
和country_codes
,一切都按预期工作,但如果我只生成country_codes
,则在应用DISTINCT
之前不会对记录进行排序,所以不相邻的重复项保留在输出中。
运行以下代码段(在本地模式下):
user_identities = GROUP user_identities BY (user_id, partner_user_id, partner_id);
user_identities = FOREACH user_identities {
sync_types = FOREACH user_identities GENERATE flatten(sync_types);
sync_types = DISTINCT sync_types;
country_codes = FOREACH user_identities GENERATE flatten(country_codes);
country_codes = DISTINCT country_codes;
GENERATE flatten(group) AS (user_id, partner_user_id, partner_id), sync_types, country_codes;
}
DUMP user_identities;
输出:
(u08,u08an,an,{(ars),(web)},{(GB),(RU),(US)})
(u09,u09an,an,{(web)},{(GB),(US)})
(u10,u10an,an,{(ars),(web)},{(GB)})
(u11,u11an,an,{(web)},{(GB)})
但是,如果我将内部GENERATE
语句更改为GENERATE flatten(group) AS (user_id, partner_user_id, partner_id), country_codes;
,省略sync_types
,则会得到以下输出(请注意第一条记录中的重复(GB)
):
(u08,u08an,an,{(GB),(RU),(GB),(US)})
(u09,u09an,an,{(GB),(US)})
(u10,u10an,an,{(GB)})
(u11,u11an,an,{(GB)})
由于我没有在我的脚本中使用sync_types
,因此除非作为此问题的解决方法,否则我认为没有理由生成它。
这是Pig中的已知错误(或猪的本地模式)吗?或者我没有正确合并行李?
答案 0 :(得分:0)
我不确定你是否还在寻找答案。 我在Pig版本0.12.x中使用了DISTINCT,它似乎按预期工作,但是我修改了下面的查询以达到预期的结果 -
user_identities = LOAD 'userPig.txt' AS (
user_id: chararray,
partner_user_id: chararray,
partner_user_id_type: chararray,
sync_types: bag{tuple(chararray)},
synced_first_timestamp: double,
synced_last_timestamp: double,
country_codes: bag{tuple(chararray)},
partner_id: chararray);
user_identities_filter = foreach user_identities generate user_id,
partner_user_id, partner_id, country_codes;
user_identities_group = GROUP user_identities_filter BY (user_id,
partner_user_id, partner_id);
user_stats = FOREACH user_identities_group {
uniq_country_codes = FOREACH user_identities_filter GENERATE
flatten(country_codes);
uniq_country_codes = DISTINCT uniq_country_codes;
GENERATE FLATTEN(group), uniq_country_codes AS uniq_coutry_codes;
}
DUMP user_stats;