我正在尝试在hive中运行Python udf,对使用flume捕获的twitter数据进行一些情绪分析。
我的推特表格代码:
CREATE EXTERNAL TABLE tweets (
id bigint,
created_at string,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
lang string,
retweet_count int,
text string,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>
)
PARTITIONED BY (datehour int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'hdfs://192.168.0.73:8020/user/flume/tweets'
我的python代码:
import hashlib
import sys
for line in sys.stdin:
line = line.strip()
(lang, text) = line.split('\t')
positive = set(["love", "good", "great", "happy", "cool", "best", "awesome", "nice", "helpful", "enjoyed"])
negative = set(["hate", "bad", "stupid", "terrible", "unhappy"])
words = text.split()
word_count = len(words)
positive_matches = [1 for word in words if word in positive]
negative_matches = [-1 for word in words if word in negative]
st = sum(positive_matches) + sum(negative_matches)
if st > 0:
print ('\t'.join([lang, text, 'positive', str(word_count)]))
elif st < 0:
print ('\t'.join([lang, text, 'negative', str(word_count)]))
else:
print ('\t'.join([lang, text, 'neutral', str(word_count)]))
最后我的Hive查询:
ADD JAR /tmp/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar;
ADD FILE /tmp/my_py_udf.py;
SELECT
TRANSFORM (lang, text)
USING 'python my_py_udf.py'
AS (lang, text, sentiment, word_count)
FROM tweets
使用此查询,我在关闭运算符时会收到错误。
如果我在python UDF中只使用一个变量,那么查询会成功运行:
text = line.replace('\n',' ')
可能来自分裂中的SerDe('\ t')?
有人可以帮忙吗?在过去的10天里,我很沮丧......