在PIG中进行多次分组和连接后拾取不正确的值

时间:2013-11-10 06:07:56

标签: hadoop apache-pig

我正在尝试使用PIG脚本执行一些数据操作和连接,并面临同样的问题。以下是我正在使用的脚本的一些部分。完整的脚本使用了大量的连接和数据操作,所以我只是为特定的clm_ofcnum显示它的一部分

est = load 'test.txt' using PigStorage('\t') as (clm_ofc:int,clm_nbr:int,fea_nbr:int,item_nbr:int,cause:chararray,resrv_amt:float,suit:chararray,txn_code:chararray,client_type:int,os_expen:float,os_loss:float,pd_expen:float,pd_loss:float,salv_rcvr:float,subro_rcvr:float,ded_amt:chararray,expen_stat:chararray,loss_stat:chararray,avg_loss_ind:chararray,avg_expen_ind:chararray,loss_covg_cd:chararray,pip_calc_ind:chararray,unit:chararray,maj_per:chararray,pip_med_pd:float,pip_wage_pd:float,pip_oth_pd:float,expen_rcvr:float,oth_loss_rcvr:float,pms_rpt_ofc:chararray,pay_inst_cd:chararray,benefit_st:chararray,body_part:chararray,maj_per_seq:chararray,class_cd:chararray,cov_eff_dt:chararray,cov_exp_dt:chararray,iso_st_abbrev:int,loss_typ_id:int,aia_item_id1:int,open_dt:chararray,close_dt:chararray,reopen_dt:chararray,css_db_cd:chararray,staff_cd:chararray,aia_item_id3:int,aia_item_id5:int,aia_item_id7:int,covg_part_cd:chararray,fea_alpha_cd:chararray,subln_cd:chararray,cvg_ver_ind:chararray,typ_bur_id:chararray,cause_id:int,changed_user:chararray,clm_ofcnum:chararray,src:chararray,etl_dt:chararray);

est3 = foreach est generate clm_ofcnum,fea_nbr,loss_stat;
est4 = group est3 by clm_ofcnum;
est5 = foreach est4 {clm_ofcnum = est3.clm_ofcnum;sorted = order est3 by fea_nbr desc;top  = limit sorted 1;generate flatten(top);};

est8 = group est by clm_ofcnum;
est9 = foreach est8 generate flatten($1.$55) as clm_ofcnum,MIN($1.$40) as open_dt,MAX  ($1.$41) as close_dt,MIN($1.$42) as reopen_dt,MIN($1.$35) as cov_eff_dt,MAX($1.$36) as cov_exp_dt,(float)SUM($1.$12) as pd_loss,(float)SUM($1.$10) as os_loss,(float)SUM($1.$11) as pd_expen,(float)SUM($1.$9) as os_expen,(float)SUM($1.$13) as salv_rcvr,(float)SUM($1.$14) as subro_rcvr;
filt = filter est9 by (clm_ofcnum == '03-123767');
dump filt;

(03-123767,,2002-06-06 00:00:00,,2002-03-11 00:00:00,2002-08-25 00:00:00,1288.71,0.0,0.0,0.0,0.0,0.0)---****The cov_eff_dt here is 2002-03-11

join2 = join est5 by clm_ofcnum,est9 by clm_ofcnum;
join2d = distinct join2;

describe join2d;

join2d: {est5::top::clm_ofcnum: chararray,est5::top::fea_nbr: int,est5::top::loss_stat: chararray,est9::clm_ofcnum: chararray,est9::open_dt: chararray,est9::close_dt: chararray,est9::reopen_dt: chararray,est9::cov_eff_dt: chararray,est9::cov_exp_dt: chararray,est9::pd_loss: float,est9::os_loss: float,est9::pd_expen: float,est9::os_expen: float,est9::salv_rcvr: float,est9::subro_rcvr: float}


est12 = foreach join2d generate est5::clm_ofcnum,est9::cov_eff_dt;
filt = filter est12 by (clm_ofcnum == '03-123767');
dump filt;

(03-123767,2002-06-06 00:00:00)----the cov_eff_dt here for the same clm_ofcnum is 2002-06-06

我无法弄清楚为什么它会找错了cov_eff_dt.Not确定我错过了什么。请提供一些意见。

1 个答案:

答案 0 :(得分:0)

玩弄了这个并找到了两个似乎已经解决了这个问题的修复程序。 1)修复了表的嵌套foreach。 2)添加了适当的::运算符以使用适当的嵌套。

看起来多个连接和nestin正在创建此问题。

由于