我有一些运行正常的非常简单的查询:
val toplevel_genre = airingAusDF.withColumn("toplevel_genre", explode(col("program.genres")))
val toplevel = toplevel_genre.groupBy("toplevel_genre.name").count().sort(desc("count")).take(10)
通过将输入迁移到Parquet格式,我面临着一个奇怪的堆栈跟踪。它似乎与英文数据一起使用。对于其他语言的数据,我遇到了这个错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 2023.0 failed 4 times, most recent failure: Lost task 10.3 in stage 2023.0 (TID 17158, ip-172-31-12-157.us-west-2.compute.internal): java.lang.ClassCastException: optional binary element (UTF8) is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:131)