我已经能够在测试集群上使用Anaconda运行pyspark运行线性回归示例。这很酷。
我的下一步是让代码对我们的分析师更加模板化。具体来说,我想将以下lambda函数重写为常规函数,因此我们可以更好地使用Python中的当前技能级别。我做了很多尝试,但map,lambda和numpy.array的组合一下子就让人困惑。
data = sc.textFile("hdfs://nameservice1:8020/spark_input/linear_regression/lpsa.data")
parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))
整个程序如下。任何帮助表示赞赏。
#!/opt/tools/anaconda/bin python
from pyspark import SparkConf, SparkContext
from pyspark.mllib.regression import LinearRegressionWithSGD
from numpy import array
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("Python - Linear Regression Test")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf = conf)
# Load and parse the data
data = sc.textFile("hdfs://nameservice1:8020/spark_input/linear_regression/lpsa.data")
parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))
# Build the model
numIterations = 50
model = LinearRegressionWithSGD.train(parsedData, numIterations)
# Evaluate model on training examples and compute training error
valuesAndPreds = parsedData.map(lambda point: (point.item(0), model.predict(point.take(range(1, point.size)))))
MSE = valuesAndPreds.map(lambda (v, p): (v-p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("training Mean Squared Error = " + str(MSE))
答案 0 :(得分:1)
def line_to_array(line):
space_separated_line = line.replace(',', ' ')
string_array = space_separated_line.split(' ')
float_array = map(float, string_array)
return array(float_array)
parsedData = map(line_to_float_array, data)
或者,相当于
def line_to_array(line):
space_separated_line = line.replace(',', ' ')
string_array = space_separated_line.split(' ')
float_array = [float(x) for x in string_array]
return array(float_array)
parsedData = [line_to_float_array(line) for line in data]
答案 1 :(得分:0)
Amadan的回答在Python本身的范围内是正确的,这是最初的问题。但是,在Spark中使用RDD(弹性分布式数据集)时,实现看起来略有不同,因为使用了Spark的map函数而不是Python:
# Declare functions at startup:
if __name__ == "__main__":
def line_to_float_array(line):
string_array = line.replace(',', ' ').split(' ')
float_array = map(float, string_array)
return array(float_array)
#
#
sc = SparkContext(conf = conf)
# Load and parse the data
data = sc.textFile("hdfs://nameservice1:8020/sparkjeb/lpsa.data")
parsedData = data.map(line_to_float_array)