问题:对象不可序列化
您能否看看如何克服这个问题。能够像正确打印一样正确阅读。但是在将记录写入实木复合地板时
对象不可序列化
由引起的:java.io.NotSerializableException: parquet.avro.AvroParquetWriter序列化堆栈: - 对象没有 serializable(class:parquet.avro.AvroParquetWriter,value: parquet.avro.AvroParquetWriter@658e7ead)
请查看并告诉我这是最好的方法。
代码:将Avro记录转换为Parquet
val records = sc.newAPIHadoopRDD(conf.getConfiguration,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]], //Transforms the PairRDD to RDD
classOf[NullWritable]).map(x => x._1.datum)
// Build a schema
val schema = SchemaBuilder
.record("x").namespace("x")
.fields
.name("x").`type`().stringType().noDefault()
.endRecord
val parquetWriter = new AvroParquetWriter[GenericRecord](new Path(outPath), schema)
val parquet = new GenericRecordBuilder(schema)
records.foreach { keyVal =>
val x = keyVal._1.datum().get("xyz") -- Field
parquet.set("x", x)
.build
parquetWriter.write(schema.build())
}
答案 0 :(得分:1)
你可以从这里开始阅读avro到数据帧 https://github.com/databricks/spark-avro
// import needed for the .avro method to be added
import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
// The Avro records get converted to Spark typesca
val df = sqlContext.read.avro("src/test/resources/episodes.avro")
df.registerTempTable("tempTable")
val sat = sqlContext.sql( //use lateral view explode )
sat.write.parquet("/tmp/output")
答案 1 :(得分:0)
我不确定你为什么采取这种做法。但我会建议采用不同的方法。如果你把avro文件放到rdd中,看起来就像你做的那样。您可以创建一个模式,然后将RDD转换为数据框,然后将数据框写为镶木地板。
var avroDF = sqlContext.createDataFrame(avroRDD,avroSchema)
avroDF
.write
.mode(SaveMode.Overwrite)
.parquet("parquet directory to write file")
答案 2 :(得分:0)
对于我的一些具有复杂结构和数组的复杂Json,我使用hive ql侧视图爆炸。这是一个扁平的复杂json的例子。它开始时为10行,对于某些迹线,我可以得到60行,有些则得到少于5.这只取决于它是如何爆炸的。
val tenj = sqlContext.read.json("file:///home/marksmith/hive/Tenfile.json")
scala> tenj.printSchema
root
|-- DDIVersion: string (nullable = true)
|-- EndTimestamp: string (nullable = true)
|-- Stalls: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Stall: long (nullable = true)
| | |-- StallType: string (nullable = true)
| | |-- TraceTypes: struct (nullable = true)
| | | |-- ActiveTicket: struct (nullable = true)
| | | | |-- Category: string (nullable = true)
| | | | |-- Traces: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- EndTime: string (nullable = true)
| | | | | | |-- ID: string (nullable = true)
| | | | | | |-- Source: string (nullable = true)
| | | | | | |-- StartPayload: struct (nullable = true)
| | | | | | | |-- SubticketID: string (nullable = true)
| | | | | | | |-- TicketID: string (nullable = true)
| | | | | | | |-- TicketState: long (nullable = true)
| | | | | | |-- StartTime: string (nullable = true)
tenj.registerTempTable("ddis")
val sat = sqlContext.sql(
"select DDIVersion, StallsExp.stall, StallsExp.StallType, at.EndTime, at.ID,
at.Source, at.StartPayload.SubTicketId, at.StartPayload.TicketID,
at.StartPayload.TicketState, at.StartTime
from ddis
lateral view explode(Stalls) st as StallsExp
lateral view explode(StallsExp.TraceTypes.ActiveTicket.Traces) at1 as at")
sat: org.apache.spark.sql.DataFrame = [DDIVersion: string, stall: bigint, StallType: string, EndTime: string, ID: string, Source: string, SubTicketId: string, TicketID: string, TicketState: bigint, StartTime: string]
sat.count
res22: Long = 10
sat.show
+----------+-----+---------+--------------------+---+------+-----------+--------+-----------+--------------------+
|DDIVersion|stall|StallType| EndTime| ID|Source|SubTicketId|TicketID|TicketState| StartTime|
+----------+-----+---------+--------------------+---+------+-----------+--------+-----------+--------------------+
| 5.3.1.11| 15| POPS4|2016-06-08T20:07:...| | STALL| 0| 777| 1|2016-06-08T20:07:...|
| 5.3.1.11| 14| POPS4|2016-06-08T20:07:...| | STALL| 0| 384| 1|2016-06-08T20:06:...|
| 5.3.1.11| 13| POPS4|2016-06-08T20:07:...| | STALL| 0| 135792| 1|2016-06-08T20:06:...|
| 5.0.0.28| 26| POPS4|2016-06-08T20:06:...| | STALL| 0| 774| 2|2016-06-08T20:03:...|