减少单词列表,将元组计数到聚合键

时间:2017-09-29 16:27:13

标签: python apache-spark pyspark rdd

我正在尝试使用Spark字数统计示例并通过其他一些值汇总字数(例如,在下面的情况下,人为“VI”或“MO”的字数和人数)

我有一个rdd,它是一个元组列表,其值是元组列表:

from operator import add
reduced_tokens = tokenized.reduceByKey(add)
reduced_tokens.take(2)

这给了我:

[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]),
 (u'MO',
  [(u'word4', 1),
   (u'word4', 1),
   (u'word5', 1),
   (u'word8', 1),
   (u'word10', 1),
   (u'word1', 1),
   (u'word4', 1),
   (u'word6', 1),
   (u'word9', 1),
   ...
 )]

我想要类似的东西:

[
 ('VI', 
    [(u'word1', 1), (u'word2', 1), (u'word3', 1)],
 ('MO', 
    [(u'word4', 58), (u'word8', 2), (u'word9', 23) ...)
]

word count example here类似,我希望能够为某些人过滤掉低于某个阈值的字词。谢谢!

2 个答案:

答案 0 :(得分:0)

您尝试减少的密钥是(name, word)对,而不仅仅是名称。因此,您需要执行.map步骤来修复数据:

def key_by_name_word(record):
  name, (word, count) = record
  return (name, word), count

tokenized_by_name_word = tokenized.map(key_by_name_word)
counts_by_name_word = tokenized_by_name_word.reduce(add)

这应该给你

[
  (('VI', 'word1'), 1),
  (('VI', 'word2'), 1),
  (('VI', 'word3'), 1),
  (('MO', 'word4'), 58),
  ...
]

为了使其与您提到的格式完全相同,您可以执行以下操作:

def key_by_name(record):
  # this is the inverse of key_by_name_word
  (name, word), count = record
  return name, (word, count)

output = counts_by_name_word.map(key_by_name).reduceByKey(add)

但使用counts_by_name_word所处的平面格式的数据实际上可能更容易。

答案 1 :(得分:0)

为了完整起见,以下是我如何解决问题的每个部分:

询问1:通过某个键汇总字数

import re

def restructure_data(name_and_freetext):
    name = name_and_freetext[0]
    tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split()
    return [((name, token), 1) for token in tokens]

filtered_data = data.filter((data.flag==1)).select('name', 'item')
tokenized = filtered_data.rdd.flatMap(restructure_data)

问2:过滤低于某个阈值的字数:

from operator import add

# keep words which have counts >= 5
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5)

# map filtered word counts into a list by key so we can sort them
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))

奖励:对从最常见到最不频繁的单词进行排序

# sort the word counts from most frequent to least frequent words
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()