相当于Apache Pig中的linux'diff'

时间:2011-05-06 07:08:25

标签: hadoop apache-pig diff

我希望能够在两个大文件上执行标准差异。我有一些可以工作的东西,但它不像命令行上的diff那么快。

A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';

有人有更好的方法吗?

1 个答案:

答案 0 :(得分:4)

我使用以下方法。 (我的JOIN方法非常相似,但此方法不会复制具有复制行的diff的行为)。正如前面提到过的那样,也许你只使用一个减速器作为Pig got an algorithm来调整减速器的数量为0.8?

  • 我使用的两种方法在性能上都是相同的百分之几,但不会重复相同的
  • JOIN方法折叠重复项(因此,如果一个文件的副本多于另一个文件,则此方法不会输出重复项)
  • UNION方法与Unix diff(1)工具类似,将为正确的文件返回正确数量的额外重复项
  • 与Unix diff(1)工具不同,顺序并不重要(当UNION执行sort -u <foo.txt> | diff
  • 时,有效的JOIN方法会执行sort <foo> | diff)
  • 如果你有一个令人难以置信的(〜数千)重复行数,那么由于连接会导致事情变慢(如果你的使用允许,首先对原始数据执行DISTINCT)
  • 如果您的行很长(例如大小> 1KB),那么建议使用DataFu MD5 UDF并且仅区别于哈希值然后与原始文件一起使用以获取原始行输出

使用加入:

SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

使用UNION:

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

<强>性能

  • 使用带有18个节点的LZO压缩输入,大约需要10分钟才能超过200GB(1,055,687,930行)。
  • 每种方法只需要一个Map / Reduce循环。
  • 这导致每个节点每分钟大约1.8GB的差异(不是很高的吞吐量,但在我的系统上似乎diff(1)只在内存中运行,而Hadoop利用流式磁盘。