我正在尝试实施Twitter情绪分析。我需要获取所有正面推文和否定推文并将其存储在特定的文本文件中。
sample.json
{"id": 252479809098223616, "created_at": "Wed Apr 12 08:23:20 +0000 2016", "text": "google is a good company", "user_id": 450990391}{"id": 252479809098223616, "created_at": "Wed Apr 12 08:23:20 +0000 2016", "text": "facebook is a bad company","user_id": 450990391}
dictionary.text 包含所有正面和关键词列表
weaksubj 1 bad adj n negative
strongsubj 1 good adj n positive
猪脚本: -
tweets = load 'new.json' using JsonLoader('id:chararray,text:chararray,user_id:chararray,created_at:chararray');
dictionary = load 'dictionary.text' AS (type:chararray,length:chararray,word:chararray,pos:chararray,stemmed:chararray,polarity:chararray);
words = foreach tweets generate FLATTEN( TOKENIZE(text) ) AS word,id,text,user_id,created_at;
sentiment = join words by word left outer, dictionary by word;
senti2 = foreach sentiment generate words::id as id,words::created_at as created_at,words::text as text,words::user_id as user_id,dictionary::polarity as polarity;
res = FILTER senti2 BY polarity MATCHES '.*possitive.*';
描述res: -
res: {id: chararray,created_at: chararray,text: chararray,user_id: chararray,polarity: chararray}
但是当我转储res时,我看不到任何输出,但它执行正常,没有任何错误。
我在这里犯的错误是什么。
请建议我。
Mohan.V
答案 0 :(得分:0)
我在这里看到2个错误
解决方案:使用PigStorage();
指定适当的分隔符 dictionary = load 'dictionary.text' AS (type:chararray,length:chararray,word:chararray,pos:chararray,stemmed:chararray,polarity:chararray);
DUMP dictionary;
(weaksubj 1 bad adj n negative,,,,,)
(strongsubj 1 good adj n positive,,,,,)
第二个错误: 第6行:纠正积极的拼写!使用像
这样的东西res = FILTER senti2 BY UPPER(polarity) MATCHES '.*POSITIVE.*';
答案 1 :(得分:0)
我看到拼写错误:
res = FILTER senti2 BY polarity MATCHES '.*possitive.*';
不是'.*positive.*'
吗?
答案 2 :(得分:0)
根据我的建议,您应该使用自定义UDF来解决您的问题。现在你可以使用elephant-bird-pig-4.1.jar,json-simple-1.1.1.jar。 此外,如果您想查看这些示例,那么您可以使用这些Sentiment Analysis Tutorial。 如果您需要代码,那么您可以根据教程和我的代码
引用这些代码并格式化代码REGISTER ‘/usr/local/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/ usr/local /elephant-bird-pig-4.1.jar';
REGISTER '/ usr/local /json-simple-1.1.1.jar’;
load_tweets = LOAD '/user/new.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
extract_details = FOREACH load_tweets GENERATE myMap#'id' as id,myMap#'text' as text;
tokens = foreach extract_details generate id,text, FLATTEN(TOKENIZE(text)) As word;
dictionary = load '/user/dictionary.text' AS (type:chararray,length:chararray,word:chararray,pos:chararray,stemmed:chararray,polarity:chararray);
word_rating = join tokens by word left outer, dictionary by word using 'replicated’; describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::text as text, dictionary::rating as rate;
word_group = group rating by (id,text);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;