我正在尝试使用Spark字数统计示例并通过其他一些值汇总字数(例如,在下面的情况下,人为“VI”或“MO”的字数和人数)
我有一个rdd,它是一个元组列表,其值是元组列表:
from operator import add
reduced_tokens = tokenized.reduceByKey(add)
reduced_tokens.take(2)
这给了我:
[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]),
(u'MO',
[(u'word4', 1),
(u'word4', 1),
(u'word5', 1),
(u'word8', 1),
(u'word10', 1),
(u'word1', 1),
(u'word4', 1),
(u'word6', 1),
(u'word9', 1),
...
)]
我想要类似的东西:
[
('VI',
[(u'word1', 1), (u'word2', 1), (u'word3', 1)],
('MO',
[(u'word4', 58), (u'word8', 2), (u'word9', 23) ...)
]
与word count example here类似,我希望能够为某些人过滤掉低于某个阈值的字词。谢谢!
答案 0 :(得分:0)
您尝试减少的密钥是(name, word)
对,而不仅仅是名称。因此,您需要执行.map
步骤来修复数据:
def key_by_name_word(record):
name, (word, count) = record
return (name, word), count
tokenized_by_name_word = tokenized.map(key_by_name_word)
counts_by_name_word = tokenized_by_name_word.reduce(add)
这应该给你
[
(('VI', 'word1'), 1),
(('VI', 'word2'), 1),
(('VI', 'word3'), 1),
(('MO', 'word4'), 58),
...
]
为了使其与您提到的格式完全相同,您可以执行以下操作:
def key_by_name(record):
# this is the inverse of key_by_name_word
(name, word), count = record
return name, (word, count)
output = counts_by_name_word.map(key_by_name).reduceByKey(add)
但使用counts_by_name_word
所处的平面格式的数据实际上可能更容易。
答案 1 :(得分:0)
为了完整起见,以下是我如何解决问题的每个部分:
询问1:通过某个键汇总字数
import re
def restructure_data(name_and_freetext):
name = name_and_freetext[0]
tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split()
return [((name, token), 1) for token in tokens]
filtered_data = data.filter((data.flag==1)).select('name', 'item')
tokenized = filtered_data.rdd.flatMap(restructure_data)
问2:过滤低于某个阈值的字数:
from operator import add
# keep words which have counts >= 5
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5)
# map filtered word counts into a list by key so we can sort them
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))
奖励:对从最常见到最不频繁的单词进行排序
# sort the word counts from most frequent to least frequent words
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()