我有半结构化的CSV,看起来像这样。
VTS,01,0099,7022606164,SP,GP,33,060646,A,1258.9805,N,07735.9303,E,0.0,278.6,280515,0000,00,4000,11,999,842,4B61
VTS,01,0099,7022606164,NM,GP,20,060637,A,1258.9805,N,07735.9302,E,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++
VTS,66,0065,7022606164,NM,0,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,01,0099,7022606164,NM,GP,22,060656,A,1258.9804,N,07735.9301,E,0.0,278.6,280515,0000,00,4000,11,999,843,8FEB+++
VTS,01,0099,7022606164,NM,GP,22,060721,A,1258.9803,N,07735.9304,E,0.0,278.6,280515,0000,00,4000,11,999,845,044D++++++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE
我想用这些数据制作三个不同的表格。即一个有VTS,另一个有VTS,99和另一个有VTS,66。我还需要删除每行附加的“+++”作为错误,我已经编写了这个猪脚本。
data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage('\n') as (f1:chararray);
splt = foreach data generate FLATTEN(STRSPLIT($0, '\\+++'));
data_pkt = FILTER splt BY $0 MATCHES '.*VTS,01+.*';
sos_pkt = FILTER splt BY $1 MATCHES '.*VTS,99+.*';
health_pkt = FILTER splt BY $2 MATCHES '.*VTS,66+.*';
当我为每个表单独测试这个脚本时,只有一个输出,我收到其余的没有输出,
dump data_pkt;
dump sos_pkt;
dump health_pkt;
我对猪很新,所以任何人都可以帮我解决这个问题。我们将深表感谢。
答案 0 :(得分:2)
这将根据值过滤您的记录。
a = load '/abc.txt' using PigStorage(',');
b1 = FILTER a by $1==01;
b66 = FILTER a by $1==66;
b99 = FILTER a by $1==99;
要删除+++你必须写一个简单的猪udf。
Out put:
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++)
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE)
答案 1 :(得分:2)
要删除+++,您还需要转义所有“+”而不仅仅是唯一的。 你对这些优点的意义并不十分具体。您可以使用该正则表达式进行拆分:
"\\+{3,}"
因此,在你的猪脚本中:
splt = foreach data generate FLATTEN(STRSPLIT($0, '\\+{3,}'));
Altough Aman是正确的,但是,我宁愿使用SPLIT而不是FILTER来分隔数据集:
a = load '/abc.txt';
SPLIT a INTO
b01 IF $1 == 01,
b66 IF $1 == 66,
b99 IF $1 == 69;
答案 2 :(得分:0)
这正在发挥一些体面的作用。
data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage(',');
splt = foreach data generate $0 as col0:chararray,$1 as col1:chararray,$2 as col2:chararray,$3 as col3:chararray,$4 as col4:chararray,$5 as col5:chararray,$6 as col6:chararray,$7 as col7:chararray,$8 as col8:chararray,$9 as col9:chararray,$10 as col10:chararray,$11 as col11:chararray,$12 as col12:chararray,$13, FLATTEN(STRSPLIT($14, '\\+++'));
data_pkt = FILTER splt BY $1 MATCHES '.*01+.*';
health_pkt = FILTER splt BY $1 MATCHES '.*66+.*';
sos_pkt = FILTER splt BY $1 MATCHES '.*99+.*';
但问题是三步。