如何过滤掉猪中变量的第一行

时间:2017-05-06 01:07:15

标签: apache-pig

我将cvs文件导入了如下变量:

basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
下面的

是前3行的输出:

tmp = limit basketball_players 3;
dump tmp

("playerID","year","stint","tmID","lgID","GP","GS","minutes","points","oRebounds","dRebounds","rebounds","assists","steals","blocks","turnovers","PF","fgAttempted","fgMade","ftAttempted","ftMade","threeAttempted","threeMade","PostGP","PostGS","PostMinutes","PostPoints","PostoRebounds","PostdRebounds","PostRebounds","PostAssists","PostSteals","PostBlocks","PostTurnovers","PostPF","PostfgAttempted","PostfgMade","PostftAttempted","PostftMade","PostthreeAttempted","PostthreeMade","note")
("abramjo01","1946","1","PIT","NBA","47","0","0","527","0","0","0","35","0","0","0","161","834","202","178","123","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0",)
("aubucch01","1946","1","DTF","NBA","30","0","0","65","0","0","0","20","0","0","0","46","91","23","35","19","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0",)

你可以看到第一行是表格的标题。我使用下面的命令过滤掉第一行,但它没有工作。

grunt> players_raw = filter basketball_players by $1 > 0;
2017-05-06 11:03:36,389 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 6 time(s).

当我转储players_raw的值时,它返回空。如何从变量中过滤掉第一行?

1 个答案:

答案 0 :(得分:0)

使用RANK生成一个新列,该列将向数据集添加行号。使用该列过滤第一行。

basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
ranked = rank basketball_players;
basketball_players_without_header = Filter ranked by (rank_basketball_players > 1);
DUMP basketball_players_without_header;

另一种方法

basketball_players = load '/usr/data/basketball_players.csv' using PigStorage(',');
basketball_players_without_header = Filter basketball_players by ($0 matches '.*playerID.*');
DUMP basketball_players_without_header;