我试图计算一段时间内带有特定主题标签的推文数量,但是在尝试使用内置SUM函数时遇到错误。
示例:
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by count;
X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count);
错误:
<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
答案 0 :(得分:0)
首先加载数据,然后过滤您要处理的时间间隔。根据主题标签对记录进行分组。使用count()函数计算相应hashtag的twitter数量。
答案 1 :(得分:0)
我不确定代码是按照您的想法或希望它做的,但是您得到的错误是因为您在错误的事情上做SUM
。你需要这样做
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count);
NBLNabilVoto_count
是数据库中元组的名称
答案 2 :(得分:0)
我认为你在SUM中使用了错误的实现,你可以将SUM NBLNabilVoto_count用于数据实现。我有问题为什么你要COUNT?
如果你想用主题标签NBLNabilVoto计算你的所有推文。
我认为代码必须像:
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by all;
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);