猪与外部文件匹配

时间:2014-03-17 13:29:53

标签: apache-pig

我有一个包含所有推文的文件(在关系A中)

today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
...

我有另一个文件(在关系B中),包含要过滤的单词

    sick
    viral fever
    feeling
    ...

我的代码

//loads all the tweets
A = load 'tweets' as tweets;
//loads all the words to be filtered
B = load 'filter_list' as filter_list;

预期产出

(sick,1)
(viral fever,2)
(feeling,1)
...

如何使用连接在猪中实现这一目标?

2 个答案:

答案 0 :(得分:1)

编辑解决方案

我之前提供的基本概念可行,但它需要添加UDF来生成推文的NGrams对。然后,您将NGram对与Tokenized推文结合,然后对该数据集执行wordcount函数。

我已经测试了下面的代码,它可以很好地对付提供的数据。如果filter_list中的记录在字符串中有超过2个单词(即:"我感觉不好"),您需要使用适当的计数重新编译ngram-udf(或者理想情况下,只需转动它变成一个变量,并在运行中设置ngram计数。

您可以在此处获取NGramGenerator UDF的源代码:Github



ngrams.pig

REGISTER ngram-udf.jar
DEFINE NGGen org.apache.pig.tutorial.NGramGenerator;

--Load the initial data
A = LOAD 'tweets.txt' as (tweet:chararray);

--Create NGram tuple with a size limit of 2 from the tweets
B = FOREACH A GENERATE FLATTEN(NGGen(tweet)) as ngram; 
--Tokenize the tweets into single word tuples
C = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)tweet)) as ngram;

--Union the Ngram and word tuples
D = UNION B,C;
--Group similar tuples together
E = GROUP D BY ngram;
--For each unique ngram, generate the ngrame name and a count
F = FOREACH E GENERATE group, COUNT(D);


--Load the wordlist for joining
Z = LOAD 'wordlist.txt' as (word:chararray);

--Perform the innerjoin of the ngrams and the wordlist
Y = JOIN F BY group, Z BY word;

--For each intersecting record, store the ngram and count
X = FOREACH Y GENERATE $0,$1;


DUMP X;



结果/输出

(feeling,1)
(viral fever,2)



tweets.txt

today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever



wordlist.txt

sick
viral fever
feeling





原始解决方案

我目前无法访问我的Hadoop系统来测试此答案,因此代码可能略有偏差。然而,逻辑应该是合理的。一个简单的解决方案应该是:

  1. 针对推文数据集执行经典的wordcount程序
  2. 执行词汇表和推文的内部联接
  3. 再次生成数据以删除元组中的重复单词
  4. 转储/存储加入结果
  5. 示例代码:

    A = LOAD 'tweets.txt';
    B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word;
    C = GROUP B BY word;
    D = FOREACH C GENERATE group, COUNT(B);
    
    Z = LOAD 'wordlist.txt' as (word:chararray);
    Y = JOIN D BY group, Z BY word;
    X = FOREACH Y GENERATE ($1,$2);
    DUMP X;
    

答案 1 :(得分:0)

据我所知,使用连接是不可能的。

您可以使用正则表达式匹配CROSS后跟FILTER