Question

我有这样的数据。

(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
(,,,)
(,,,)
(,,,)
(,,,)

我想要取出空袋子。 期望的输出

(a,b,c,d)
(g,b,v,n)
(n,h,l,o)

Answer 1

如果您能够提供迄今为止的代码，那会更好。

方法一：按空值过滤

-- load comma deliited values into columns
A = load './input.txt' using PigStorage(',') as (one:chararray, two:chararray, three:chararray, four:chararray);
dump A;

-- remove records where columns are null
B = FILTER A BY (one is not null) OR (two is not null) OR (three is not null) OR (four is not null);
dump B;

这假定input.txt如下。

a,b,c,d
g,b,v,n
n,h,l,o
,,,
,,,
,,,
,,,

运行命令：

pig -x local clean.pig

输出第一次转储：

(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
(,,,)
(,,,)
(,,,)
(,,,)

输出第二次转储：

(a,b,c,d)
(g,b,v,n)
(n,h,l,o)

方法二：按列大小过滤

-- load comma deliited values into columns
A = load './input.txt' using PigStorage(',') as (one:chararray, two:chararray, three:chararray, four:chararray);
dump A;

-- generate column count
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
dump B;

-- filter by column count
C = FILTER B BY $0 > 0;
dump C;

-- remove column count
D = FOREACH C GENERATE $1..;
dump D;

转储A的输出：

(a,b,c,d)
(g,b,v,n)
(n,h,l,o)
(,,,)
(,,,)
(,,,)
(,,,)

转储B的输出：

(4,a,b,c,d)
(4,g,b,v,n)
(4,n,h,l,o)
(0,,,,)
(0,,,,)
(0,,,,)
(0,,,,)

转储C的输出：

(4,a,b,c,d)
(4,g,b,v,n)
(4,n,h,l,o)

转储D的输出：

(a,b,c,d)
(g,b,v,n)
(n,h,l,o)

P.S：

如果输入文件最初有括号，则可能需要单独处理。

取出Pig中的空袋

1 个答案: