Question

我有一个python字典：

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}

我创建了一个像这样的RDD：

docNameToText = sc.parallelize(diction)

我需要计算查找每个文档中出现的前2个字符串。所以，结果看起来应该是这样的：

1.txt, test, is
2.txt, test, that

我是pyspark的新手，我知道算法，但不知道怎么做是pyspark。我需要：

- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2

我该如何实现？

Answer 1

只需使用Counter：

from collections import Counter 

(sc
    .parallelize(diction.items())
    # Split by whitepace
    .mapValues(lambda s: s.split())
    # Count
    .mapValues(Counter)
    # Take most commont
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)]))

PySpark - 来自多个文件文件的前n个单词

1 个答案: