如何计算从textfile导入数据到密集向量python spark的平均值

时间:2015-10-24 11:56:52

标签: python apache-spark

我创建了以下代码来计算平均值。

我输入的数据如

  

empname,年龄,工资
  一,10,100
  B,20200
  C,30,300
  d,40,400
  即,50500个

from pyspark import SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.stat import Statistics

SparkContext.setSystemProperty('spark.executor.memory', '2g')
sc = SparkContext("local", "Simple App")
text_file = sc.textFile("hdfs://localhost:9000/user/trial/emp_data.txt")
parts = text_file.map(lambda l: l.split(","))
p1 = parts.map(lambda p: int(p[2]))
rdd = sc.parallelize([Vectors.dense(p1.collect())])
cStats = Statistics.colStats(rdd)
cStats.mean()

我已经为列工资制作了RDD并传递给[Vectors.dense(p1.collect())],但我的输出为[100.0,200.0,300.0,400.0,500.0]。但它应该是300

1 个答案:

答案 0 :(得分:0)

from statistics import mean

data = '''
a,10,100

b,20,200

c,30,300

d,40,400

e,50,500
'''

li = re.findall(r',(\d+)$',data,re.MULTILINE)

print(mean(map(int,li)))

300.0