我创建了以下代码来计算平均值。
我输入的数据如
empname,年龄,工资
一,10,100
B,20200
C,30,300
d,40,400
即,50500个
from pyspark import SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.stat import Statistics
SparkContext.setSystemProperty('spark.executor.memory', '2g')
sc = SparkContext("local", "Simple App")
text_file = sc.textFile("hdfs://localhost:9000/user/trial/emp_data.txt")
parts = text_file.map(lambda l: l.split(","))
p1 = parts.map(lambda p: int(p[2]))
rdd = sc.parallelize([Vectors.dense(p1.collect())])
cStats = Statistics.colStats(rdd)
cStats.mean()
我已经为列工资制作了RDD并传递给[Vectors.dense(p1.collect())]
,但我的输出为[100.0,200.0,300.0,400.0,500.0]
。但它应该是300
。
答案 0 :(得分:0)
from statistics import mean
data = '''
a,10,100
b,20,200
c,30,300
d,40,400
e,50,500
'''
li = re.findall(r',(\d+)$',data,re.MULTILINE)
print(mean(map(int,li)))
300.0