使用Spark解析XML - 面对不同方法的问题需要建议吗?

时间:2017-08-29 09:40:13

标签: xml scala apache-spark

我有XML,我试图用Spark Code解析。我有两种方法:

  1. 使用com.databricks.spark.xml.XmlReader
  2. 使用HiveContext - 制作DF并遍历xml
  3. 如果有这些请分享的任何其他方式。我将在下面分享我的答案

3 个答案:

答案 0 :(得分:0)

以下是我使用的示例代码,但问题是如果Schema中有任何更改,则始终需要更改代码。

import Util.XmlFileUtil
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.hive.HiveContext

object xPathReader extends App{

    System.setProperty("hadoop.home.dir","C:\\hadoop\\winutils")

    val sparkConf = new SparkConf().setAppName("MstarXmlIngestion").setMaster("local[5]")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new HiveContext(sc)
    val fscXmlPath = "C:\\data\\xml"
    val xmlRddList = XmlFileUtil.withCharset(sc, fscXmlPath, "UTF-8", "book")

    import hiveContext.implicits._

    val xmlDf = xmlRddList.toDF("xml")
    xmlDf.registerTempTable("TEMPTABLE")

    hiveContext.sql("select xpath_string(xml,\"/book/@_id\") as BookId, xpath_string(xml,\"/book/description\") as description, CurrID, Price from TEMPTABLE " +
      "LATERAL VIEW OUTER explode(xpath(xml,\"/book/market/price/text()\")) PRICE_LIST as Price " +
      "LATERAL VIEW OUTER explode(xpath(xml,\"/book/currency/price/text()\")) CurrID_LIST as CurrID").show()
}


Another XmlFileUtil Class

===================================================
import com.databricks.spark.xml.util.XmlFile;
import org.apache.spark.SparkContext;
import org.apache.spark.rdd.RDD;

/**
 * Utility class to access private `XmlFile` scala object from spark xml package
 */
public class XmlFileUtil {
    public static RDD<String> withCharset(SparkContext context, String location, String charset, String rowTag) {
        return XmlFile.withCharset(context, location, charset, rowTag);
    }
}

答案 1 :(得分:0)

可悲的是,我们无法为所有XML文件使用通用解决方案。 如果您的XML文件发生了变化,代码也会发生变化。

在我的代码中,我通过数据库使用了SparkSQL的XML源代码 你可以找到它here

答案 2 :(得分:0)

如上所述是另一种方法:

object XmlSecondApproach extends App{

  System.setProperty("hadoop.home.dir","C:\\hadoop\\winutils")

  val sparkConf = new SparkConf().setAppName("Second Approach").setMaster("local[5]")
  val sc = new SparkContext(sparkConf)
  val hiveContext = new HiveContext(sc)
  val rawXmlPath = "C:\\data\\xml\\RawXML"

  val objXMLDR = new XmlDataReader

  var rawDf = objXMLDR.getRawXmlDataframe(sc, rawXmlPath, "book")

  rawDf.registerTempTable("tempBookTable")

  rawDf.printSchema()

  hiveContext.udf.register("customFunction",UDFWrappedArrayConverter.checkIfWrappedArray)

  //When we get more than one arrays in a parent node then we need to handle parent repeated node with explode
  hiveContext.sql("SELECT book.currency.price as curPrice, " +    
    "markET.price as mktPrice from tempBookTable " +
    "LATERAL VIEW OUTER explode(book.market) Mrkt_List as markET").show()   
}

//Define another scala class
class XmlDataReader {

  def getRawXmlDataframe(sc:SparkContext,  xmlPath: String, rowTag: String): DataFrame = {

    val xmlRddList = XmlFileUtil.withCharset(sc, xmlPath, "UTF-8", rowTag)
    val xmlReader = new XmlReader()
    val xmlDf = xmlReader.withAttributePrefix(Constant.ATTR_PREFIX).withValueTag(Constant.VALUE_TAG).withRowTag(rowTag.toLowerCase).xmlRdd(hiveContext, xmlRddList)
    return xmlDf
  }
}