我有一些代码行可以对数据集进行预处理:
val clean_data = resultDf.na.replace("verkehrsstatus", Map("aktuell nicht ermittelbar" -> "normales Verkehrsaufkommen"))
val datawithudf = clean_data.withColumn("state", udfState()($"verkehrsstatus"))
val finaldata = datawithudf.select($"auswertezeit", $"strecke_id", $"state", $"geschwindigkeit", $"coordinates").withColumnRenamed("state", "verkehrsstatus")
finaldata.printSchema()
finaldata.take(2).foreach(println)
当我想显示来自最终DataFrame的一些示例记录时,我收到此消息错误:
WARN scheduler.TaskSetManager:在阶段2.0中丢失任务0.0(TID 2,192.168.56.102,执行器0):java.lang.ClassCastException:无法将scala.collection.immutable.List $ SerializationProxy实例分配给字段org.apache在org.apache.spark.rdd.MapPartitionsRDD实例中,类型为scala.collection.Seq的.spark.rdd.RDD.org $ apache $ spark $ rdd $ RDD $$ dependencies_ 在java.io.ObjectStreamClass $ FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287) 在java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1417) 在java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2293) 在java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) 在java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) 在scala.collection.immutable.List $ SerializationProxy.readObject(List.scala:479) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498) 在java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170) 在java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178) 在java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) 在java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287) 在java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211) 在java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069) 在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) 在java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) 在org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) 在org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) 在org.apache.spark.scheduler.Task.run(Task.scala:109) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)