我正在尝试使用Spark读取和写入Parquet文件作为RDD。我不能在我当前的应用程序中使用Spark-Sql-Context(它需要StructType中的镶木地板模式,当我从Avro Schema转换时,在少数情况下会给我castException)
因此,如果我尝试通过重载AvroParquetFormat并将ParquetInputFormat发送到Hadoop来实现并保存Parquet文件,请按以下方式编写:
def saveAsParquetFile[T <: IndexedRecord](records: RDD[T], path: String)(implicit m: ClassTag[T]) = {
val keyedRecords: RDD[(Void, T)] = records.map(record => (null, record))
spark.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
val job = Job.getInstance(spark.hadoopConfiguration)
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, m.runtimeClass.newInstance().asInstanceOf[IndexedRecord].getSchema())
keyedRecords.saveAsNewAPIHadoopFile(
path,
classOf[Void],
m.runtimeClass.asInstanceOf[Class[T]],
classOf[ParquetOutputFormat[T]],
job.getConfiguration
)
}
这是错误的错误:
Exception in thread "main" java.lang.InstantiationException: org.apache.avro.generic.GenericRecord
我正在调用该函数如下:
val file1: RDD[GenericRecord] = sc.parquetFile[GenericRecord]("/home/abc.parquet")
sc.saveAsParquetFile(file1, "/home/abc/")