比较PIG中的两个变量

时间:2016-07-21 05:56:55

标签: hadoop apache-pig hadoop2

我有两个文档,我需要用第一个文档词来过滤第二个文档单词

我曾尝试但没有工作

Action action = async () =>
{
    try
    {
        Console.WriteLine("Action start...");
        await Task.Delay(1000);
        throw new Exception("Exception from an async action");
    }
    catch(Exception ex)
    {
        // do something
    }
};

2 个答案:

答案 0 :(得分:0)

而不是过滤我使用的连接

一个。内部联接:

A = load '/user/balanagaraju.maliset/Dump/abc.txt'  AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray); 

words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as wordz;

x = JOIN words1 by word , words2 by wordz;

grouped = group x BY word;

D = foreach grouped generate  COUNT(x), group;

Dump D;

b.Cross加入:

A = load '/user/balanagaraju.maliset/Dump/abc.txt'  AS (line:chararray);
B = load '/user/balanagaraju.maliset/Dump/abc.txt' AS (line:chararray); 

words1 = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as word;
words2 = FOREACH B GENERATE FLATTEN(TOKENIZE(line)) as word;


C= CROSS words1,words2;
CC = foreach C generate $0 as first ,$1 as second;
R = FILTER CC by first==second;

grouped = group R BY first;

D = foreach grouped generate  group, COUNT(R);

Dump D;

答案 1 :(得分:0)

您的要求似乎是: -

您有2个文件A和B.您想要排除文件A中存在的所有单词。您可以使用左外连接。

脚本将如下所示: -

file1 = load' A'使用PigStorage()作为(word1:chararray);

file2 =加载' B'使用PigStorage()作为(word2:chararray);

join = join file2 by word2 left outer,file1 by word1;

filtered =由word1连接的过滤器为null;

dump filtered;

说明: - left outer将确保包含file2中的所有单词。因此file1和file2中的所有匹配单词都将具有非null值。如果过滤掉NULL值word1,则它们是file2中存在的剩余单词,但不存在于file1