我是Apache Pig的新用户,我有以下数据
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
...
我试图提取到以下记录
0012,1,23
0013,2,34
0015,1,45
0011,1,456
...
以下是我尝试过的代码
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
order_item:chararray,
order_pid: chararray,
order_qty: chararray,
order_price: chararray
);
不起作用。
另一个尝试通过保存到Bag:
a = LOAD 'a.txt' Using TextLoader() AS (line:chararray);
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, 'order=((\\d+),(\\d+),(\\d+))+')) AS
(
B: bag { T: tuple(
order_pid: chararray,
order_qty: chararray,
order_price: char array
)}
);
仍然无效。
答案 0 :(得分:0)
你能试试吗?
<强>输入强>
order=0012,1,23
order=0013,2,34,0015,1,45
order=0011,1,456
<强> PigScript:强>
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(REGEX_EXTRACT(line,'order=(.*)',1),','));
C = FOREACH B GENERATE FLATTEN(TOBAG(TOTUPLE($0..$2),TOTUPLE($3..$5)));
D = FILTER C BY $0 is not null;
DUMP D;
<强>输出:强>
(0012,1,23)
(0013,2,34)
(0015,1,45)
(0011,1,456)