我有一个包含所有推文的文件(在关系A中)
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
...
我有另一个文件(在关系B中),包含要过滤的单词
sick
viral fever
feeling
...
我的代码
//loads all the tweets
A = load 'tweets' as tweets;
//loads all the words to be filtered
B = load 'filter_list' as filter_list;
预期产出
(sick,1)
(viral fever,2)
(feeling,1)
...
如何使用连接在猪中实现这一目标?
答案 0 :(得分:1)
我之前提供的基本概念可行,但它需要添加UDF来生成推文的NGrams对。然后,您将NGram对与Tokenized推文结合,然后对该数据集执行wordcount函数。
我已经测试了下面的代码,它可以很好地对付提供的数据。如果filter_list中的记录在字符串中有超过2个单词(即:"我感觉不好"),您需要使用适当的计数重新编译ngram-udf(或者理想情况下,只需转动它变成一个变量,并在运行中设置ngram计数。
您可以在此处获取NGramGenerator UDF的源代码:Github
的 ngrams.pig 强>
REGISTER ngram-udf.jar
DEFINE NGGen org.apache.pig.tutorial.NGramGenerator;
--Load the initial data
A = LOAD 'tweets.txt' as (tweet:chararray);
--Create NGram tuple with a size limit of 2 from the tweets
B = FOREACH A GENERATE FLATTEN(NGGen(tweet)) as ngram;
--Tokenize the tweets into single word tuples
C = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)tweet)) as ngram;
--Union the Ngram and word tuples
D = UNION B,C;
--Group similar tuples together
E = GROUP D BY ngram;
--For each unique ngram, generate the ngrame name and a count
F = FOREACH E GENERATE group, COUNT(D);
--Load the wordlist for joining
Z = LOAD 'wordlist.txt' as (word:chararray);
--Perform the innerjoin of the ngrams and the wordlist
Y = JOIN F BY group, Z BY word;
--For each intersecting record, store the ngram and count
X = FOREACH Y GENERATE $0,$1;
DUMP X;
的结果/输出强>
(feeling,1)
(viral fever,2)
的 tweets.txt 强>
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
的 wordlist.txt 强>
sick
viral fever
feeling
我目前无法访问我的Hadoop系统来测试此答案,因此代码可能略有偏差。然而,逻辑应该是合理的。一个简单的解决方案应该是:
示例代码:
A = LOAD 'tweets.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
Z = LOAD 'wordlist.txt' as (word:chararray);
Y = JOIN D BY group, Z BY word;
X = FOREACH Y GENERATE ($1,$2);
DUMP X;
答案 1 :(得分:0)
据我所知,使用连接是不可能的。
您可以使用正则表达式匹配CROSS
后跟FILTER
。