使用python在spark中运行random_forest_example.py时出错

时间:2015-07-16 09:25:30

标签: python apache-spark

我是Spark的新手,所以我正在浏览Spark示例文件夹中提供的一些示例。当我尝试 random_forest_example.py 时,出现以下错误:

  

py4j.protocol.Py4JJavaError:调用时发生错误   Z:org.apache.spark.api.python.PythonRDD.runJob。 :   org.apache.spark.SparkException:作业因阶段失败而中止:   阶段2.0中的任务0失败1次,最近失败:丢失任务0.0   在阶段2.0(TID 3,localhost):java.net.SocketException:Connection   由peer重置:套接字写入错误   java.net.SocketOutputStream.socketWrite0(Native Method)at   java.net.SocketOutputStream.socketWrite(未知来源)at   java.net.SocketOutputStream.write(未知来源)at   java.io.BufferedOutputStream.flushBuffer(未知来源)at   java.io.BufferedOutputStream.flush(未知来源)at   java.io.DataOutputStream.flush(未知来源)at   org.apache.spark.api.python.PythonRDD $ WriterThread $$ anonfun $运行$ 3.apply(PythonRDD.scala:251)     在   org.apache.spark.util.Utils $ .logUncaughtExceptions(Utils.scala:1772)     在   org.apache.spark.api.python.PythonRDD $ WriterThread.run(PythonRDD.scala:208)

我运行的代码是:

from __future__ import print_function

import sys

from pyspark import SparkContext
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer
from pyspark.ml.regression import RandomForestRegressor
from pyspark.mllib.evaluation import MulticlassMetrics, RegressionMetrics
from pyspark.mllib.util import MLUtils
from pyspark.sql import Row, SQLContext

"""
A simple example demonstrating a RandomForest Classification/Regression Pipeline.
Run with:
  bin/spark-submit examples/src/main/python/ml/random_forest_example.py
"""


def testClassification(train, test):
    # Train a RandomForest model.
    # Setting featureSubsetStrategy="auto" lets the algorithm choose.
    # Note: Use larger numTrees in practice.

    rf = RandomForestClassifier(labelCol="indexedLabel", numTrees=3, maxDepth=4)

    model = rf.fit(train)
    predictionAndLabels = model.transform(test).select("prediction", "indexedLabel") \
        .map(lambda x: (x.prediction, x.indexedLabel))

    metrics = MulticlassMetrics(predictionAndLabels)
    print("weighted f-measure %.3f" % metrics.weightedFMeasure())
    print("precision %s" % metrics.precision())
    print("recall %s" % metrics.recall())


def testRegression(train, test):
    # Train a RandomForest model.
    # Note: Use larger numTrees in practice.

    rf = RandomForestRegressor(labelCol="indexedLabel", numTrees=3, maxDepth=4)

    model = rf.fit(train)
    predictionAndLabels = model.transform(test).select("prediction", "indexedLabel") \
        .map(lambda x: (x.prediction, x.indexedLabel))

    metrics = RegressionMetrics(predictionAndLabels)
    print("rmse %.3f" % metrics.rootMeanSquaredError)
    print("r2 %.3f" % metrics.r2)
    print("mae %.3f" % metrics.meanAbsoluteError)


if __name__ == "__main__":
    if len(sys.argv) > 1:
        print("Usage: random_forest_example", file=sys.stderr)
        exit(1)
    sc = SparkContext(appName="PythonRandomForestExample")
    sqlContext = SQLContext(sc)

    # Load and parse the data file into a dataframe.
    df = MLUtils.loadLibSVMFile(sc, "D:\spark-1.4.0\examples\src\main\python\ml\sample_libsvm_data.txt").toDF()

    # Map labels into an indexed column of labels in [0, numLabels)
    stringIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
    si_model = stringIndexer.fit(df)
    td = si_model.transform(df)
    [train, test] = td.randomSplit([0.7, 0.3])
    testClassification(train, test)
    testRegression(train, test)
    sc.stop()

我逐行检查,发现错误是在此行生成的

 df = MLUtils.loadLibSVMFile(sc, "D:\spark1.4.0\examples\src\main\python\ml\sample_libsvm_data.txt").toDF()

似乎.toDF()方法出了问题,但我不知道是什么导致了它。任何人都可以帮我解决这个问题。

1 个答案:

答案 0 :(得分:1)

您是否在后台运行Spark实例?如果没有,您应该在脚本中本地部署Spark(" setMaster("本地")"部分配置),这来自官方Spark文档:

from pyspark import SparkConf, SparkContext
conf = (SparkConf()
         .setMaster("local")
         .setAppName("My app")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)code here