我有两个实木复合地板文件,一个包含一个整数字段myField
,另一个包含一个双字段myField
。尝试一次读取两个文件时
val basePath = "/path/to/file/"
val fileWithInt = basePath + "intFile.snappy.parquet"
val fileWithDouble = basePath + "doubleFile.snappy.parquet"
val result = spark.sqlContext.read.option("mergeSchema", true).option("basePath", basePath).parquet(Seq(fileWithInt, fileWithDouble): _*).select("myField")
我收到以下错误
Caused by: org.apache.spark.SparkException: Failed to merge fields 'myField' and 'myField'. Failed to merge incompatible data types IntegerType and DoubleType
传递显式架构时
val schema = StructType(Seq(new StructField("myField", IntegerType)))
val result = spark.sqlContext.read.schema(schema).option("mergeSchema", true).option("basePath", basePath).parquet(Seq(fileWithInt, fileWithDouble): _*).select("myField")
失败,并显示以下
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
投射成两倍时
val schema = StructType(Seq(new StructField("myField", DoubleType)))
我明白了
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:60)
除了重新处理源数据以外,是否有人知道解决此问题的方法。
答案 0 :(得分:2)
根据要读取的文件数,可以使用以下两种方法之一:
这对于较小数量的实木复合地板文件是最好的
def merge(spark: SparkSession, paths: Seq[String]): DataFrame = {
import spark.implicits._
paths.par.map {
path =>
spark.read.parquet(path).withColumn("myField", $"myField".cast(DoubleType))
}.reduce(_.union(_))
}
这种方法最好处理大量文件,因为它将使谱系简短
def merge2(spark: SparkSession, paths: Seq[String]): DataFrame = {
import spark.implicits._
spark.sparkContext.union(paths.par.map {
path =>
spark.read.parquet(path).withColumn("myField", $"myField".cast(DoubleType)).as[Double].rdd
}.toList).toDF
}