从scalaxb解析的XML创建Dataframe

时间:2015-09-02 19:07:17

标签: scala apache-spark scalaxb

我可以使用Spark流fileStream方法成功解析放入目录的XML数据,我可以将生成的RDD写到文本文件中:

val fStream = {
  ssc.fileStream[LongWritable, Text, XmlInputFormat](
    WATCHDIR, xmlFilter _, newFilesOnly = false, conf = hadoopConf)
}


fStream.foreachRDD(rdd =>
  if (rdd.count() == 0) {
    logger.info("No files..")
  })

val dStream = fStream.map{ case(x, y) =>
  logger.info("Hello from the dStream")
  logger.info(y.toString)
  scalaxb.fromXML[Music](scala.xml.XML.loadString(y.toString))
}

dStream.foreachRDD(rdd => rdd.saveAsTextFile("file:///tmp/xmlout"))

问题在于我想将RDD转换为DataFrames,以便将它们注册为临时表或saveAsParquetFile

此代码:

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
dStream.foreachRDD(rdd => rdd.distinct().toDF().printSchema())

导致此错误:

java.lang.UnsupportedOperationException: Schema for type scalaxb.DataRecord[scala.Any] is not supported

我原本以为,scalaxb为我的记录生成了案例类,并且它很简单 for Spark to infer using reflection,我发现这是它尝试做的事情,除了Spark不支持scalaxb.DataRecord类型。是否有任何Spark或Scalaxb专家对如何使Scalaxb生成的案例类与Spark兼容有任何想法?

BTW,这是scalaxb生成的类:

package generated

case class Song(attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
  lazy val title = attributes.get("@title") map { _.as[String] }
  lazy val length = attributes.get("@length") map { _.as[String] }
}

case class Album(song: Seq[generated.Song] = Nil,
  description: String,
  attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
  lazy val title = attributes.get("@title") map { _.as[String] }
}

case class Artist(album: Seq[generated.Album] = Nil,
  attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
  lazy val name = attributes.get("@name") map { _.as[String] }
}

case class Music(artist: Seq[generated.Artist] = Nil)

0 个答案:

没有答案