我有一个数据框:df
|---itemId----|----Country------------|
| 11 | US |
| 13 | France |
| 101 | Fra nce |
如何在同一数据框中添加列值:
|---itemId----|----Country------------|----Type-----|
| 11 | US | NA |
| 13 | France | EU |
| 101 | France | EU |
我试过了:
df: org.apache.spark.sql.DataFrame = [itemId: string, Country: string]
testMap: scala.collection.Map[String,com.model.PeopleInfo]
val peopleMap = sc.broadcast(testMap)
val getTypeFunc : (String => String) = (country: String) => {
if (StringUtils.isNotBlank(peopleMap.value(itemId).getType)) {
peopleMap.value(itemId).getType
}
"Unknown Type"
}
val typefunc = udf(getTypeFunc)
val newDF = df.withColumn("Type",typefunc(col("Country")))
但我一直收到错误:
org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:220) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:205) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:211) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:207) at org.apache.zeppelin.scheduler.Job.run(Job.java:170) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:304) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
我使用火花1.6和EMR emr-4.3.0 Zeppelin-Sandbox 0.5.5:
Cluster size = 30 type = r3.8Xlarge
spark.executor.instances 170
spark.executor.cores 5
spark.driver.memory 219695M
spark.yarn.driver.memoryOverhead 21969
spark.executor.memory 38G
spark.yarn.executor.memoryOverhead 21969
spark.default.parallelism 1856
spark.kryoserializer.buffer.max 512m
spark.sql.hive.convertMetastoreParquet false
spark.hadoop.mapreduce.input.fileinputformat.split.maxsize 33554432
我在这里做错了吗?