Question

我想使用Apache pig将文件分成4个相等的部分。例如，如果文件有100行，则前25个应该转到第1个输出文件，依此类推..最后25行应该转到第4个输出文件。有人可以帮助我实现这一目标。我正在使用Apache pig，因为文件中的记录数将以百万为单位，并且之前的步骤生成需要拆分的文件使用Pig。

Answer 1

我做了一些挖掘，因为它出现了hadoop的Hortonworks样本考试。它似乎没有很好的记录 - 但它真的非常简单。在这个例子中，我使用了在dev.mysql.com上提供下载的Country示例数据库：

grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';

然后，如果我们看一下hdfs中的目录：

[root@sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r--   3 hive hdfs          0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r--   3 hive hdfs       3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r--   3 hive hdfs       4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r--   3 hive hdfs       4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002

希望有所帮助。

Answer 2

您可以使用以下某些PIG功能来获得所需的结果。

SPLIT函数http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
MultiStorage类：https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
编写自定义PIG存储空间：https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions

您必须根据您的数据提供一些条件。

Answer 3

这可以做到。但可能有更好的选择。

A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';

Answer 4

我的要求有所改变，我必须将前25％的数据存储到一个文件中，其余的存储到另一个文件中。这是适用于我的猪脚本。

ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1  <=  $0/4;
rest_file = filter with_max by $1  >  $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');

使用apache Pig将文件拆分为4个相等的部分

4 个答案: