我尝试编写一个简单的Scala程序,将数据转储到Parquet文件到HDFS。
我创建了一个Avro架构,使用此架构初始化ParquetWriter
,按照定义的架构将我的记录映射到GenericRecords
,然后尝试使用镶木地板编写器编写它们。
但不幸的是,我在运行程序时遇到以下异常:
java.lang.ClassCastException: parquet.io.MessageColumnIO cannot be cast to parquet.io.PrimitiveColumnIO
at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.getColumnWriter(MessageColumnIO.java:339)
at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:376)
at parquet.io.ValidatingRecordConsumer.addBinary(ValidatingRecordConsumer.java:211)
at parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:260)
at parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167)
at parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)
架构定义:
val avroSchema: Schema = SchemaBuilder.record("event_snapshots").fields()
.requiredString("userid")
.requiredString("event")
.requiredString("firstevent")
.requiredString("lastevent")
.requiredInt("count")
.endRecord()
val parquetSchema = new AvroSchemaConverter().convert(avroSchema)
编剧:
val writeSupport = new AvroWriteSupport[GenericRecord](parquetSchema, avroSchema, null)
val blockSize = 256 * 1024 * 1024
val pageSize = 64 * 1024
val writer = new ParquetWriter[GenericRecord](outputDir, writeSupport,
CompressionCodecName.SNAPPY, blockSize,
pageSize, pageSize, false, true, configuration)
记录构建并写入:
val recordBuilder = new GenericRecordBuilder(avroSchema)
recordBuilder.set(avroSchema.getField("userid"), userKey)
recordBuilder.set(avroSchema.getField("event"), eventKey)
recordBuilder.set(avroSchema.getField("firstevent"),
dateTimeDateFormat.format(firstEvent))
recordBuilder.set(avroSchema.getField("lastevent"),
dateTimeDateFormat.format(lastEvent))
recordBuilder.set(avroSchema.getField("count"), event.count)
val record = recordBuilder.build()
writer.write(record)
有什么想法吗?