在Pig中删除重复对

时间:2016-12-01 14:27:22

标签: apache-pig

我正在使用以下示例

更新

OBR|1|METABOLIC PANEL
OBX|1|Glucose
OBX|2|BUN
OBX|3|CREATININE
OBR|2|RFLX TO VERIFICATION
OBX|1|EGFR
OBX|2|SODIUM
OBR|3|AMBIGUOUS DEFAULT
OBX|1|POTASSIUM

在此示例中,将所有 OBR 视为一个测试,每个 OBR 后跟 OBX ,这是的结果OBR 即可。每个 OBR 后跟id(例如1,2和3),特定OBR的所有 OBX 都以1开头。那么我的意思是我发现一个 OBR 我会创建一个唯一的ID并将其放入所有 OBX ,然后是 OBR ,直到再次ID为2的OBR 我会做同样的事情。 以下是我的预期输出。

预期结果:

OBR|1|METABOLIC PANEL|OBR_filename_1
OBX|1|Glucose|OBR_filename_1
OBX|2|BUN|OBR_filename_1
OBX|3|CREATININE|OBR_filename_1
OBR|2|RFLX TO VERIFICATION|OBR_filename_2
OBX|1|EGFR|OBR_filename_2
OBX|2|SODIUM|OBR_filename_2
OBR|3|AMBIGUOUS DEFAULT|OBR_filename_3
OBX|1|POTASSIUM|OBR_filename_3

2 个答案:

答案 0 :(得分:1)

使用DISTINCT。假设您的关系A具有重复记录。以下语句将删除重复记录并将唯一记录存储在关系B中

B = DISTINCT A;

答案 1 :(得分:1)

我试过这个,它看起来像一个HL文件。你可以使用Stitch,Over&领导并想出类似的东西。从性能的角度来看,可能有比这更好的解决方案。但这应该可行,我想,请让我知道它是怎么回事。

DEFINE Over org.apache.pig.piggybank.evaluation.Over('long');
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE lead org.apache.pig.piggybank.evaluation.Lead;

in = LOAD 'hl_file' using PigStorage('|') as (id:chararray, num:int, reason:chararray);
temp = rank in;
ranked = foreach temp generate $0 as row_no, $1 as id:chararray, $2 as orig_id:int, $3 as reason:chararray;
OBR_data = FILTER ranked by id == 'OBR';
next_row_num_OBR = FOREACH (group OBR_data by id) {
sorted = ORDER OBR_data by row_no;
stitched = Stitch(sorted, Over(sorted.row_no, 'lead',0,1,1,(long)9999));
generate flatten(group) as (id:chararray), 
flatten(stitched.(row_no, orig_id, reason, result)) as (row_no:long, orig_id:int, reason:chararray, next_row_no:long);
}
OBX_data = FILTER ranked by id == 'OBX';
Crossed = CROSS next_row_num_OBR, OBX_data;
result = FILTER Crossed BY (OBX_data::row_no > next_row_num_OBR::row_no and OBX_data::row_no < next_row_num_OBR::next_row_no);

这应该产生这样的东西:

(OBR,5,2,RFLX TO VERIFICATION,8,7,OBX,2,SODIUM)

(OBR,1,1,METABOLIC PANEL,5,2,OBX,1,Glucose)

(OBR,5,2,RFLX TO VERIFICATION,8,6,OBX,1,EGFR)

(OBR,8,3,AMBIGUOUS DEFAULT,9999,9,OBX,1,POTASSIUM)

(OBR,1,1,METABOLIC PANEL,5,3,OBX,2,BUN)

(OBR,1,1,METABOLIC PANEL,5,4,OBX,3,CREATININE)

它只是将OBR记录添加到相应的OBX中而不是文件名或常量。