在zeppelin scala中读取大型JSON文件时出现org.apache.thrift.transport.TTransportException错误

时间:2016-04-25 08:10:40

标签: json scala apache-spark apache-zeppelin

我正在尝试使用Zeppelin和Scala读取大型JSON文件(1.5 GB)。

Zeppelin正在以10 GB RAM的VM上安装在Ubuntu OS上的本地模式SPARK上运行。我已将8GB分配给spark.executor.memory

我的代码如下

val inputFileWeather="/home/shashi/incubator-zeppelin-master/data/ai/weather.json"
val temp=sqlContext.read.json(inputFileWeather)

我收到以下错误

org.apache.thrift.transport.TTransportException
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
    at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
    at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:241)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:225)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:229)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:229)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
    at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

2 个答案:

答案 0 :(得分:7)

您遇到的错误是由于运行Spark解释器时出现问题,因此Zeppelin无法连接解释器进程。

您必须检查位于/PATH/TO/ZEPPELIN/logs/*.out的日志,以确切了解所发生的情况。也许在解释器日志中,您将看到一个OOM。

我认为对于10 GB的VM上的执行程序内存为8GB是不合理的(并且您启动了多少个执行程序?)。您还必须考虑驱动程序记忆

答案 1 :(得分:0)

增加pyspark解释器中的驱动程序内存,即spark.driver.memory。默认情况下,其1G