我有一个python字典:
diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}
我创建了一个像这样的RDD:
docNameToText = sc.parallelize(diction)
我需要计算查找每个文档中出现的前2个字符串。所以,结果看起来应该是这样的:
1.txt, test, is
2.txt, test, that
我是pyspark的新手,我知道算法,但不知道怎么做是pyspark。我需要:
- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2
我该如何实现?
答案 0 :(得分:0)
只需使用Counter
:
from collections import Counter
(sc
.parallelize(diction.items())
# Split by whitepace
.mapValues(lambda s: s.split())
# Count
.mapValues(Counter)
# Take most commont
.mapValues(lambda c: [x for (x, _) in c.most_common(2)]))