在PySpark中使用简洁的Python lambda代码更容易理解

时间:2014-03-04 04:31:25

标签: python lambda

我已经能够在测试集群上使用Anaconda运行pyspark运行线性回归示例。这很酷。

我的下一步是让代码对我们的分析师更加模板化。具体来说,我想将以下lambda函数重写为常规函数,因此我们可以更好地使用Python中的当前技能级别。我做了很多尝试,但map,lambda和numpy.array的组合一下子就让人困惑。

data = sc.textFile("hdfs://nameservice1:8020/spark_input/linear_regression/lpsa.data")
parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))

整个程序如下。任何帮助表示赞赏。

#!/opt/tools/anaconda/bin python

from pyspark import SparkConf, SparkContext
from pyspark.mllib.regression import LinearRegressionWithSGD
from numpy import array

conf = SparkConf()
conf.setMaster("local")
conf.setAppName("Python - Linear Regression Test")
conf.set("spark.executor.memory", "1g")

sc = SparkContext(conf = conf)

# Load and parse the data

data = sc.textFile("hdfs://nameservice1:8020/spark_input/linear_regression/lpsa.data")
parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))

# Build the model
numIterations = 50
model = LinearRegressionWithSGD.train(parsedData, numIterations)

# Evaluate model on training examples and compute training error
valuesAndPreds = parsedData.map(lambda point: (point.item(0), model.predict(point.take(range(1, point.size)))))

MSE = valuesAndPreds.map(lambda (v, p): (v-p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("training Mean Squared Error = " + str(MSE))

2 个答案:

答案 0 :(得分:1)

def line_to_array(line):
    space_separated_line = line.replace(',', ' ')
    string_array = space_separated_line.split(' ')
    float_array = map(float, string_array)
    return array(float_array)

parsedData = map(line_to_float_array, data)

或者,相当于

def line_to_array(line):
    space_separated_line = line.replace(',', ' ')
    string_array = space_separated_line.split(' ')
    float_array = [float(x) for x in string_array]
    return array(float_array)

parsedData = [line_to_float_array(line) for line in data]

答案 1 :(得分:0)

Amadan的回答在Python本身的范围内是正确的,这是最初的问题。但是,在Spark中使用RDD(弹性分布式数据集)时,实现看起来略有不同,因为使用了Spark的map函数而不是Python:

# Declare functions at startup:
if __name__ == "__main__":
    def line_to_float_array(line):
        string_array = line.replace(',', ' ').split(' ')
        float_array = map(float, string_array)
        return array(float_array)

#
#
sc = SparkContext(conf = conf)

# Load and parse the data
data = sc.textFile("hdfs://nameservice1:8020/sparkjeb/lpsa.data")
parsedData = data.map(line_to_float_array)