我试图确定文本文件中包含以下格式推文的前10个主题标签:
USER_79321756 2010-03-05T04:48:05 ÜT: 47.528139,-122.197916 47.528139 -122.197916 Just talkin too for real. Ha.
USER_79321756 2010-03-05T20:25:56 ÜT: 47.528139,-122.197916 47.528139 -122.197916 RT @USER_620cd4b9: @USER_79321756 hey now! Leave me, and my big eyes alone LOL>>lol NO! :*
USER_4659ef22 2010-03-06T05:50:54 ÜT: 40.816206,-73.894429 40.816206 -73.894429 But where's @USER_55e0f4ff?? Hmmm shawty where u at?
USER_064b120e 2010-03-03T18:56:49 ÜT: 34.223957,-118.600448 34.223957 -118.600448 @USER_4a4d09c2 the ludacris one . have you heard it , he got off on that one .
我想出了以下代码片段。
代码:
a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by STARTSWITH(tokens,'#');
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;
这给出了如下所示的结果。
结果:
(#ff, 55)
(#inhighschool, 25)
...
...
...
...
...
...
(#random, 9)
(#mewithoutyouislike, 7)
我也包括了输出的图像。
Output showing top 10 hashtags
但是,如果我在字编辑器中打开包含推文的文本文件(full_text_small.txt),并搜索井号标签“ #ff”(不区分大小写),则总数为61,而不是55。类似,输出中所有其他主题标签的计数与使用Pig获得的计数不同。
此外,当我使用另一种匹配技术(即以下所示)时,我得到的结果略有不同。
代码:
a = load '/user/lab/pig/full_text_small.txt' AS (id:chararray, ts:chararray, location:chararray, lat:float, lon:float, tweet:chararray);
b = foreach a generate tweet, FLATTEN(TOKENIZE(LOWER(tweet))) as tokens;
c = filter b by tokens MATCHES '#\\s*(\\w+)';
d = group c by tokens;
e = foreach d generate group as tokens, COUNT(c) as cnt;
f = order e by cnt desc;
g = limit f 10;
dump g;
结果:
(#ff, 55)
(#inhighschool, 25)
...
...
...
...
...
...
(#random, 9)
(#realgrandmas, 7)
第二个代码段的输出图像:
两个代码段的输出中的所有主题标签都相同,除了最后一个。
我的问题如下:
答案 0 :(得分:1)
这是我的理论:
SORT
和随后的LIMIT
中哪个标签会获得更高的优先级。 TOKENIZE
,后跟STARTSWITH
,因此您希望主题标签前面有一个空格。在您的文本编辑器中搜索时,可能您的搜索中包含“ #ff”主题标签,这些主题标签也没有空格。