pyspark:ERROR

时间:2016-12-02 20:51:01

标签: python apache-spark pyspark apache-spark-mllib apache-zeppelin

我使用的是Zeppelin Notebooks / Apache Spark,我经常收到以下错误:

  

org.apache.thrift.transport.TTransportException       at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)       在org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)       在org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)       在org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)       在org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)       在org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)       at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService $ Client.recv_interpret(RemoteInterpreterService.java:249)       at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService $ Client.interpret(RemoteInterpreterService.java:233)       at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:269)       at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)       在org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:279)       在org.apache.zeppelin.scheduler.Job.run(Job.java:176)       在org.apache.zeppelin.scheduler.RemoteScheduler $ JobRunner.run(RemoteScheduler.java:328)       at java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511)       at java.util.concurrent.FutureTask.run(FutureTask.java:266)       at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.access $ 201(ScheduledThreadPoolExecutor.java:180)       at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)       在java.lang.Thread.run(Thread.java:745)

如果我尝试再次运行相同的代码(只是忽略错误),我得到这个(只是顶线):

  

java.net.SocketException:管道损坏(写入失败)

然后,如果我尝试第三次(或之后的任何时间)运行它,我会收到此错误:

  

java.net.ConnectException:拒绝连接(拒绝连接)

如果我在Zeppelin Notebooks中重新启动解释器,那么它(最初)可以工作但最终我最终会再次收到此错误。

我的过程中的各个步骤(数据清理,矢量化等)都发生了这个错误,但最常出现的时间(到目前为止)是我适合模型的时候。为了让您更好地了解我实际在做什么以及通常何时发生,我将引导您完成我的流程:

我使用Apache Spark ML并完成了一些标准的矢量化,加权等(CountVectorizer,IDF),然后在该数据上构建模型。

我使用VectorAssember创建我的特征向量,将其转换为密集向量,并将其转换为数据帧:

assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "punct_title", "cap_title", "punct_excerpt", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")

vector_train = assembler.transform(train_raw).select("Target", "features")
vector_test = assembler.transform(test_raw).select("Target", "features")

train_final = vector_train.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
test_final = vector_test.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))

train_final_df = sqlContext.createDataFrame(train_final)
test_final_df = sqlContext.createDataFrame(test_final)

因此,进入模型的训练集看起来像这样(实际的数据集有~15k列,而我的下采样到~5k的例子只是为了试图让它运行):

  

[Row(features = DenseVector([7016.0,9.0,16.0,2.0,2.0,4.0,5.0,0.0,   0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.315,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ............... .....   0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 7.235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0.0,0.0,0.0,0.0,0.0]),label = 0)]

下一步是拟合模型,这是错误通常会弹出的地方。我已经尝试了适合单个模型和运行CV(w / ParamGrid):

单一型号:

from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

gbt = GBTClassifier(labelCol="label", featuresCol="features", maxDepth=8, maxBins=16, maxIter=40)
GBT_model = gbt.fit(train_final_df)

predictions_GBT = GBT_model.transform(test_final_df)
predictions_GBT.cache()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auroc = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderPR"})

使用CV / PG:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import GBTClassifier

GBT_model = GBTClassifier()

paramGrid = ParamGridBuilder() \
    .addGrid(GBT_model.maxDepth, [2,4]) \
    .addGrid(GBT_model.maxBins, [2,4]) \
    .addGrid(GBT_model.maxIter, [10,20]) \
    .build()

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", metricName="areaUnderPR")

crossval = CrossValidator(estimator=GBT_model, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5) 

cvModel = crossval.fit(train_final_df)

我知道这与翻译有关,但无法弄清楚:(a)我做错了什么或(b)如何解决这个小故障

更新:我在SO Apache Spark聊天中被要求提供版本和内存配置,所以我想我会在这里提供更新。

版本:

  • Spark:2.0.1
  • Zeppelin:0.6.2

内存配置:

  • 我使用c1.xlarge EC2(7 GiB)实例作为我的主服务器运行EMR集群,并使用r3.8xlarge(244 GiB)作为我的核心节点
  • 在Zeppelin中,我进入并将spark.driver.memory更改为4g并将spark.executor.memory更改为128g

在我进入并设置这些Zeppelin内存配置后,我再次运行了我的代码并仍然遇到了同样的错误。

我刚刚开始使用Spark,是否还需要设置其他内存配置?这些内存配置不合理吗?

0 个答案:

没有答案