我使用以下表达式使用Scala将行转换为数据帧中的列:
val df = Seq(
("ID-1", "First Name", "Jolly"),
("ID-1", "Middle Name", "Jr"),
("ID-1", "Last Name", "Hudson"),
("ID-2", "First Name", "Kathy"),
("ID-2", "Last Name", "Oliver"),
("ID-3", "Last Name", "Short"),
("ID-3", "Middle Name", "M"),
("ID-4", "First Name", "Denver")
).toDF("ID", "Title", "Values")
df.filter($"Title" isin ("First Name", "Last Name", "Middle Name")).
groupBy("ID").pivot("Title").agg(first($"Values")).
select( $"ID", $"First Name", $"Last Name", $"Middle Name").
show(false)
// +----+----------+---------+-----------+
// |ID |First Name|Last Name|Middle Name|
// +----+----------+---------+-----------+
// |ID-1|Jolly |Hudson |Jr |
// |ID-3|null |Short |M |
// |ID-4|Denver |null |null |
// |ID-2|Kathy |Oliver |null |
// +----+----------+---------+-----------+
输出符合预期,但最终出现如下异常:
java.lang.IllegalArgumentException:字段“ null”不存在
在获得预期的输出和解决方案后,请帮助理解导致此异常的原因。
以下是错误日志:
2018-09-12 12:09:54 [Executor task launch worker-1] ERROR o.a.s.e.Executor - Exception in task 15.0 in stage 69.0 (TID 4453)
java.lang.IllegalArgumentException: Field "null" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:233)
at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:233)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:232)
at org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:213)
at gbam.refdata.dataquality.utils.DataQualityRule$class.getColumn(DataQualityRule.scala:147)
at gbam.refdata.dataquality_rules2.VendorpartyAddress.getColumn(VendorpartyAddress.scala:27)
at gbam.refdata.dataquality.utils.DataQualityRule$$anonfun$getMissing$1$1.apply(DataQualityRule.scala:153)
at gbam.refdata.dataquality.utils.DataQualityRule$$anonfun$getMissing$1$1.apply(DataQualityRule.scala:153)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at gbam.refdata.dataquality.utils.DataQualityRule$class.getMissing$1(DataQualityRule.scala:152)
at gbam.refdata.dataquality.utils.DataQualityRule$class.getBreaks(DataQualityRule.scala:156)
at gbam.refdata.dataquality_rules2.VendorpartyAddress.getBreaks(VendorpartyAddress.scala:27)
at gbam.refdata.dataquality_rules2.VendorpartyAddress$$anonfun$4.apply(VendorpartyAddress.scala:103)
at gbam.refdata.dataquality_rules2.VendorpartyAddress$$anonfun$4.apply(VendorpartyAddress.scala:103)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)