Scala:在向现有数据框Spark添加新列时出错?

时间:2016-09-13 07:51:50

标签: scala apache-spark dataframe emr apache-zeppelin

我有一个数据框:df

          |---itemId----|----Country------------|
          |     11      |     US                |
          |     13      |     France            | 
          |     101     |     Fra nce           |   

如何在同一数据框中添加列值:

          |---itemId----|----Country------------|----Type-----|
          |     11      |     US                |    NA       |  
          |     13      |     France            |    EU       |  
          |     101     |     France            |    EU       |

我试过了:

df: org.apache.spark.sql.DataFrame = [itemId: string,  Country: string]

testMap: scala.collection.Map[String,com.model.PeopleInfo] 

val peopleMap = sc.broadcast(testMap)

val getTypeFunc : (String => String) = (country: String) => {
    if (StringUtils.isNotBlank(peopleMap.value(itemId).getType)) {
       peopleMap.value(itemId).getType
    }
"Unknown Type"
  }

    val typefunc = udf(getTypeFunc)

val newDF = df.withColumn("Type",typefunc(col("Country")))

但我一直收到错误:

org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:220) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:205) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:211) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:207) at org.apache.zeppelin.scheduler.Job.run(Job.java:170) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:304) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at

我使用火花1.6和EMR emr-4.3.0 Zeppelin-Sandbox 0.5.5:

Cluster size = 30 type = r3.8Xlarge

spark.executor.instances         170
spark.executor.cores             5
spark.driver.memory              219695M
spark.yarn.driver.memoryOverhead 21969
spark.executor.memory            38G
spark.yarn.executor.memoryOverhead 21969
spark.default.parallelism        1856
spark.kryoserializer.buffer.max  512m
spark.sql.hive.convertMetastoreParquet false
spark.hadoop.mapreduce.input.fileinputformat.split.maxsize 33554432

我在这里做错了吗?

0 个答案:

没有答案