我有一个大型(21GB)制表符分隔的格式
的数据框DOCID_1 TERMID_1 TITLE_1 YEAR_1 AUTHOR_1
DOCID_1 TERMID_2 TITLE_1 YEAR_1 AUTHOR_1
...
DOCID_n TERMID_n TITLE_n YEAR_n AUTHOR_n
也就是说,(DOCID,TERMID)对总是唯一地标识一行。我需要的是一个数据框,其中DOCID单独唯一地标识一行,并且TERMID被折叠成逗号分隔的chararray列表。例如,
DOCID_1 TERMID_11, TERMID_12, ..., TERMID_n TITLE_1 YEAR_1 AUTHOR_1
...
DOCID_n TERMID_n1, TERMID_n2, ..., TERMID_n TITLE_1 YEAR_n AUTHOR_n
有人能想到在Pig中这样做的好方法吗?
答案 0 :(得分:1)
SEMINORMALIZED = LOAD 'so.txt' USING PigStorage(',') AS (
doc_id:chararray
,term_id:chararray
,title:chararray
,year:chararray
,author:chararray
);
KEYS = FOREACH SEMINORMALIZED GENERATE
doc_id
,term_id
;
ATTRIBUTES = FOREACH SEMINORMALIZED GENERATE
doc_id
,title
,year
,author
;
ATTRIBUTES = DISTINCT ATTRIBUTES;
GROUPED = GROUP KEYS BY doc_id;
ZNF = FOREACH GROUPED GENERATE
group AS doc_id
,KEYS.term_id; AS term_ids
DENORMALIZED = JOIN ZNF BY doc_id, ATTRIBUTES BY doc_id;