Question

我是IT领域的新手。我试图使用python在spark shell（pyspark）上执行简单的功能。该文件格式为.txt，其中只包含数字。我用行

上传了它

numbers_rdd = sc.textFile("/users/hadoop/numbers.txt")

我尝试使用

获得平均值

numbers_rdd.mean()

我相信这是因为我需要将textfile中的数据从字符串转换为数字，但我不确定。请告知如何继续。

numbers_rdd.take（5）output

Answer 1

您正在从文本文件中加载数字，其中每行数字都在不同的rdd元素中，因此：

拆分每一行，强制转换为int并展平结果列表，这样你就可以将每个数字作为一个单独的元素：

numbers_rdd.flatMap(lambda x: [int(y) for y in x.split()]).mean()