我使用的是Zeppelin Notebooks / Apache Spark,我经常收到以下错误:
org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) 在org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) 在org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) 在org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) 在org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) 在org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService $ Client.recv_interpret(RemoteInterpreterService.java:249) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService $ Client.interpret(RemoteInterpreterService.java:233) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:269) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94) 在org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:279) 在org.apache.zeppelin.scheduler.Job.run(Job.java:176) 在org.apache.zeppelin.scheduler.RemoteScheduler $ JobRunner.run(RemoteScheduler.java:328) at java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.access $ 201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)
如果我尝试再次运行相同的代码(只是忽略错误),我得到这个(只是顶线):
java.net.SocketException:管道损坏(写入失败)
然后,如果我尝试第三次(或之后的任何时间)运行它,我会收到此错误:
java.net.ConnectException:拒绝连接(拒绝连接)
如果我在Zeppelin Notebooks中重新启动解释器,那么它(最初)可以工作但最终我最终会再次收到此错误。
我的过程中的各个步骤(数据清理,矢量化等)都发生了这个错误,但最常出现的时间(到目前为止)是我适合模型的时候。为了让您更好地了解我实际在做什么以及通常何时发生,我将引导您完成我的流程:
我使用Apache Spark ML并完成了一些标准的矢量化,加权等(CountVectorizer,IDF),然后在该数据上构建模型。
我使用VectorAssember创建我的特征向量,将其转换为密集向量,并将其转换为数据帧:
assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "punct_title", "cap_title", "punct_excerpt", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")
vector_train = assembler.transform(train_raw).select("Target", "features")
vector_test = assembler.transform(test_raw).select("Target", "features")
train_final = vector_train.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
test_final = vector_test.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
train_final_df = sqlContext.createDataFrame(train_final)
test_final_df = sqlContext.createDataFrame(test_final)
因此,进入模型的训练集看起来像这样(实际的数据集有~15k列,而我的下采样到~5k的例子只是为了试图让它运行):
[Row(features = DenseVector([7016.0,9.0,16.0,2.0,2.0,4.0,5.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.315,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ............... ..... 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 7.235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 0.0,0.0,0.0,0.0,0.0]),label = 0)]
下一步是拟合模型,这是错误通常会弹出的地方。我已经尝试了适合单个模型和运行CV(w / ParamGrid):
单一型号:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxDepth=8, maxBins=16, maxIter=40)
GBT_model = gbt.fit(train_final_df)
predictions_GBT = GBT_model.transform(test_final_df)
predictions_GBT.cache()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auroc = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderPR"})
使用CV / PG:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import GBTClassifier
GBT_model = GBTClassifier()
paramGrid = ParamGridBuilder() \
.addGrid(GBT_model.maxDepth, [2,4]) \
.addGrid(GBT_model.maxBins, [2,4]) \
.addGrid(GBT_model.maxIter, [10,20]) \
.build()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", metricName="areaUnderPR")
crossval = CrossValidator(estimator=GBT_model, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
cvModel = crossval.fit(train_final_df)
我知道这与翻译有关,但无法弄清楚:(a)我做错了什么或(b)如何解决这个小故障
更新:我在SO Apache Spark聊天中被要求提供版本和内存配置,所以我想我会在这里提供更新。
版本:
内存配置:
在我进入并设置这些Zeppelin内存配置后,我再次运行了我的代码并仍然遇到了同样的错误。
我刚刚开始使用Spark,是否还需要设置其他内存配置?这些内存配置不合理吗?