PySpark - 来自多个文件文件的前n个单词

时间:2017-04-17 06:27:17

标签: python apache-spark pyspark spark-streaming

我有一个python字典:

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}

我创建了一个像这样的RDD:

docNameToText = sc.parallelize(diction)

我需要计算查找每个文档中出现的前2个字符串。所以,结果看起来应该是这样的:

1.txt, test, is
2.txt, test, that

我是pyspark的新手,我知道算法,但不知道怎么做是pyspark。我需要:

- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2

我该如何实现?

1 个答案:

答案 0 :(得分:0)

只需使用Counter

from collections import Counter 

(sc
    .parallelize(diction.items())
    # Split by whitepace
    .mapValues(lambda s: s.split())
    # Count
    .mapValues(Counter)
    # Take most commont
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)]))