我正在努力让我的代码在EMR上运行Zeppelin(emr-5.10.0,Zeppelin 0.7.3,Spark 2.2.0)。
代码很简单,在{400}个样本的训练数据帧(约40K正数和360K负数)上拟合CrossValidator
和RandomForestClassifier
。
当我进行简单的训练时(比如100个最大深度为15的树),一切顺利,但当我在ParamGridBuilder
中使用更多值进行更重的测试时,我得到了org.apache.thrift.transport.TTransportException
我做了不知道如何追查那个错误的原因。
我正在使用三台c3.8xlarge机器的集群,在Zeppelin上使用以下Spark解释器设置:
spark.executor.memory = 15g
spark.yarn.executor.memoryOverhead = 2048
spark.executor.cores = 10
我与spark.memory.fraction
一起玩没有成功,我也尝试通过上面的三个设置来改变执行者的数量,但没有成功。
我觉得这是一个齐柏林飞艇的问题,但我无法追查异常的原因。我查看了日志而没有发现TTransportException
以外的任何异常,这本身就没有用。
高度赞赏如何追踪异常反应的任何帮助或提示。
以下是代码:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
import org.apache.spark.ml.linalg.Vectors
val genreIndexer = new StringIndexer()
.setInputCol("genre")
.setOutputCol("genreIndex")
.setHandleInvalid("skip")
val genreEncoder = new OneHotEncoder()
.setInputCol(genreIndexer.getOutputCol)
.setOutputCol("genreVec")
val featuresAssembler = new VectorAssembler()
.setInputCols(Array("hourOfDay", "dayOfWeek_number", "dayOfMonth", "genreVec"))
.setOutputCol("features")
val classifier = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val paramGrid = new ParamGridBuilder()
.addGrid(classifier.numTrees, Array(200, 400))
.addGrid(classifier.maxDepth, Array(10, 20))
.build()
val pipeline = new Pipeline().setStages(Array(genreIndexer, genreEncoder, featuresAssembler, classifier))
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
val cvModel = cv.fit(train_df)
以下是我在日志和Zeppelin中看到的异常:
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:373)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:406)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)