Question

我试图通过使用python程序计算单词数。

from pyspark import SparkContext

sc = SparkContext(appName="Words")
lines = sc.textFile(sys.argv[1], 1)
counts=dict()
words = lines.split(" ")
for word in words:
    if word in counts:
        counts[word] += 1
    else:
        counts[word] = 1

output = counts.collect()
for (word, count) in output:
    print "%s: %i" % (word, count)

sc.stop()

这并没有给我所需的输出。这段代码可以有任何改进吗？

Answer 1

好像你在混淆python和spark。

当您使用pyspark.SparkContext.textFile()阅读文件时，您将获得字符串的RDD。引用自己从answer到different question：

您要执行的所有操作都在RDD的内容上，文件的元素。在RDD上调用split()并不成功感觉，因为split()是一个字符串函数。你想做什么而是调用split()以及每条记录上的其他操作（行在RDD的文件中）。这正是map()所做的。

以下是使用pySpark修改代码以计算单词频率的方法。

首先，我们会将每行中的每个单词w映射到(w, 1)形式的元组。然后我们将调用reduceByKey()并添加每个单词的计数。

例如，如果该行为"The quick brown fox jumps over the lazy dog"，则地图步骤会将此行转换为：

[('The', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumps', 1), ('over', 1),
 ('the', 1), ('lazy', 1), ('dog', 1)]

由于这会返回元组列表，因此我们将调用flatMap()，以便将每个元组视为唯一记录。在这里要考虑的另一件事是你是否希望计数区分大小写，以及是否有任何标点符号或特殊字符要删除。

在flatMap()之后，我们可以调用reduceByKey()，它使用相同的键（在本例中为单词）收集所有元组，并对值应用reduce函数（在本例中为operator.add() }）。

from pyspark import SparkContext
from operator import add

sc = SparkContext(appName="Words")
lines = sc.textFile(sys.argv[1], 1)  # this is an RDD

# counts is an rdd is of the form (word, count)
counts = lines.flatMap(lambda x: [(w.lower(), 1) for w in x.split()]).reduceByKey(add)

# collect brings it to a list in local memory
output = counts.collect()
for (word, count) in output:
    print "%s: %i" % (word, count)

sc.stop()  # stop the spark context

使用Python的文件中的单词频率

1 个答案: