spark-avro无法解析自定义avro文件

时间:2015-02-13 13:32:24

标签: scala apache-spark avro

我正在尝试使用spark-avro_2.10-0.1和Spark 1.2.0来解析用snappy压缩的自定义avro文件。

我开始点火如下:

bin/spark-shell --driver-memory 16g --driver-cores 8 --executor-memory 32g --executor-cores 16 --jars libs/spark-avro_2.10-0.1.jar,libs/spark-csv-assembly-0.1.1.jar

我可以从spark-avro运行示例代码但是当我尝试读取自定义avro时,我得到了java.lang.ClassCastException:org.apache.avro.mapred.Pair无法强制转换为org.apache.avro .generic.GenericData $记录即使是像count这样的简单操作。

我已经通过运行验证了avro文件是好的:

java -jar avro-tools-1.7.3.jar getschema part-00011.avro

java -jar avro-tools-1.7.3.jar tojson part-00011.avro 

给了我我期望的东西。

以下是我遇到错误的详细信息:

    Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.2.0
      /_/

Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_55)
Type in expressions to have them evaluated.
Type :help for more information.
15/02/13 08:25:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.

scala> // example code, works fine

scala> 

scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext

scala> val sqlContext = new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@479d0c7

scala> import com.databricks.spark.avro._
import com.databricks.spark.avro._

scala> val episodes = sqlContext.avroFile("../ptest/episodes.avro")
episodes: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
PhysicalRDD [title#0,air_date#1,doctor#2], MappedRDD[2] at map at AvroRelation.scala:64

scala> import sqlContext._
import sqlContext._

scala> episodes.select('title).collect()
res0: Array[org.apache.spark.sql.Row] = Array([The Eleventh Hour], [The Doctor's Wife], [Horror of Fang Rock], [An Unearthly Child], [The Mysterious Planet], [Rose], [The Power of the Daleks], [Castrolava])

scala> 

scala> // avro file that does not work

scala> val events = sqlContext.avroFile("../ptest/part-00011.avro")
events: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[6] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
PhysicalRDD [key#3,value#4], MappedRDD[8] at map at AvroRelation.scala:64

scala> events.count
[Stage 1:>                                                                                                            (0 + 0) / 2]15/02/13 08:25:35 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
java.lang.ClassCastException: org.apache.avro.mapred.Pair cannot be cast to org.apache.avro.generic.GenericData$Record
    at com.databricks.spark.avro.AvroRelation$$anonfun$com$databricks$spark$avro$AvroRelation$$createConverter$5.apply(AvroRelation.scala:132)
    at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:64)
    at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1.apply(AvroRelation.scala:64)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
    at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
    at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
    at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

对于进行故障排除或其他解决方案的后续步骤的任何想法都将不胜感激。

需要注意的一个项目是架构包含嵌套记录的内容,如下所示:

    {
  "type" : "record",
  "name" : "Pair",
  "namespace" : "org.apache.avro.mapred",
  "fields" : [ {
    "name" : "key",
    "type" : "string",
    "doc" : ""
  }, {
    "name" : "value",
    "type" : {
      "type" : "record",
      "name" : "Event",
      "namespace" : "com.dummy.avro",
      "fields" : [
      { ... rest of file omitted 

0 个答案:

没有答案