Spark Avro到Parquet作家

时间:2016-07-19 15:28:28

标签: hadoop apache-spark hdfs avro parquet

问题:对象不可序列化

您能否看看如何克服这个问题。能够像正确打印一样正确阅读。但是在将记录写入实木复合地板时

对象不可序列化

  由引起的:java.io.NotSerializableException:   parquet.avro.AvroParquetWriter序列化堆栈: - 对象没有   serializable(class:parquet.avro.AvroParquetWriter,value:   parquet.avro.AvroParquetWriter@658e7ead)

请查看并告诉我这是最好的方法。

代码:将Avro记录转换为Parquet

  val records = sc.newAPIHadoopRDD(conf.getConfiguration,
  classOf[AvroKeyInputFormat[GenericRecord]],
  classOf[AvroKey[GenericRecord]], //Transforms the PairRDD to RDD 
  classOf[NullWritable]).map(x => x._1.datum) 

  // Build a schema
  val schema = SchemaBuilder
  .record("x").namespace("x")
  .fields
  .name("x").`type`().stringType().noDefault()
  .endRecord

val parquetWriter = new AvroParquetWriter[GenericRecord](new Path(outPath), schema)

val parquet  = new GenericRecordBuilder(schema)

records.foreach { keyVal =>
  val x = keyVal._1.datum().get("xyz") -- Field
     parquet.set("x", x)
        .build
      parquetWriter.write(schema.build())
    }

3 个答案:

答案 0 :(得分:1)

你可以从这里开始阅读avro到数据帧 https://github.com/databricks/spark-avro

// import needed for the .avro method to be added
import com.databricks.spark.avro._

val sqlContext = new SQLContext(sc)

// The Avro records get converted to Spark typesca
val df = sqlContext.read.avro("src/test/resources/episodes.avro")

df.registerTempTable("tempTable")
val sat = sqlContext.sql( //use lateral view explode )
sat.write.parquet("/tmp/output")

答案 1 :(得分:0)

我不确定你为什么采取这种做法。但我会建议采用不同的方法。如果你把avro文件放到rdd中,看起来就像你做的那样。您可以创建一个模式,然后将RDD转换为数据框,然后将数据框写为镶木地板。

var avroDF = sqlContext.createDataFrame(avroRDD,avroSchema)
avroDF
    .write
    .mode(SaveMode.Overwrite)
    .parquet("parquet directory to write file")

答案 2 :(得分:0)

对于我的一些具有复杂结构和数组的复杂Json,我使用hive ql侧视图爆炸。这是一个扁平的复杂json的例子。它开始时为10行,对于某些迹线,我可以得到60行,有些则得到少于5.这只取决于它是如何爆炸的。

val tenj = sqlContext.read.json("file:///home/marksmith/hive/Tenfile.json")

scala> tenj.printSchema
root

 |-- DDIVersion: string (nullable = true)
 |-- EndTimestamp: string (nullable = true)
 |-- Stalls: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Stall: long (nullable = true)
 |    |    |-- StallType: string (nullable = true)
 |    |    |-- TraceTypes: struct (nullable = true)
 |    |    |    |-- ActiveTicket: struct (nullable = true)
 |    |    |    |    |-- Category: string (nullable = true)
 |    |    |    |    |-- Traces: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- EndTime: string (nullable = true)
 |    |    |    |    |    |    |-- ID: string (nullable = true)
 |    |    |    |    |    |    |-- Source: string (nullable = true)
 |    |    |    |    |    |    |-- StartPayload: struct (nullable = true)
 |    |    |    |    |    |    |    |-- SubticketID: string (nullable = true)
 |    |    |    |    |    |    |    |-- TicketID: string (nullable = true)
 |    |    |    |    |    |    |    |-- TicketState: long (nullable = true)
 |    |    |    |    |    |    |-- StartTime: string (nullable = true)

tenj.registerTempTable("ddis")


val sat = sqlContext.sql(
    "select DDIVersion, StallsExp.stall, StallsExp.StallType, at.EndTime, at.ID, 
       at.Source, at.StartPayload.SubTicketId, at.StartPayload.TicketID, 
       at.StartPayload.TicketState, at.StartTime  
    from ddis 
      lateral view explode(Stalls) st as StallsExp 
      lateral view explode(StallsExp.TraceTypes.ActiveTicket.Traces) at1 as at")
sat: org.apache.spark.sql.DataFrame = [DDIVersion: string, stall: bigint, StallType: string, EndTime: string, ID: string, Source: string, SubTicketId: string, TicketID: string, TicketState: bigint, StartTime: string]

sat.count
res22: Long = 10

sat.show
+----------+-----+---------+--------------------+---+------+-----------+--------+-----------+--------------------+
|DDIVersion|stall|StallType|             EndTime| ID|Source|SubTicketId|TicketID|TicketState|           StartTime|
+----------+-----+---------+--------------------+---+------+-----------+--------+-----------+--------------------+
|  5.3.1.11|   15|    POPS4|2016-06-08T20:07:...|   | STALL|          0|     777|          1|2016-06-08T20:07:...|
|  5.3.1.11|   14|    POPS4|2016-06-08T20:07:...|   | STALL|          0|     384|          1|2016-06-08T20:06:...|
|  5.3.1.11|   13|    POPS4|2016-06-08T20:07:...|   | STALL|          0|  135792|          1|2016-06-08T20:06:...|
|  5.0.0.28|   26|    POPS4|2016-06-08T20:06:...|   | STALL|          0|     774|          2|2016-06-08T20:03:...|