我有XML,我试图用Spark Code解析。我有两种方法:
如果有这些请分享的任何其他方式。我将在下面分享我的答案
答案 0 :(得分:0)
以下是我使用的示例代码,但问题是如果Schema中有任何更改,则始终需要更改代码。
import Util.XmlFileUtil
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.hive.HiveContext
object xPathReader extends App{
System.setProperty("hadoop.home.dir","C:\\hadoop\\winutils")
val sparkConf = new SparkConf().setAppName("MstarXmlIngestion").setMaster("local[5]")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
val fscXmlPath = "C:\\data\\xml"
val xmlRddList = XmlFileUtil.withCharset(sc, fscXmlPath, "UTF-8", "book")
import hiveContext.implicits._
val xmlDf = xmlRddList.toDF("xml")
xmlDf.registerTempTable("TEMPTABLE")
hiveContext.sql("select xpath_string(xml,\"/book/@_id\") as BookId, xpath_string(xml,\"/book/description\") as description, CurrID, Price from TEMPTABLE " +
"LATERAL VIEW OUTER explode(xpath(xml,\"/book/market/price/text()\")) PRICE_LIST as Price " +
"LATERAL VIEW OUTER explode(xpath(xml,\"/book/currency/price/text()\")) CurrID_LIST as CurrID").show()
}
Another XmlFileUtil Class
===================================================
import com.databricks.spark.xml.util.XmlFile;
import org.apache.spark.SparkContext;
import org.apache.spark.rdd.RDD;
/**
* Utility class to access private `XmlFile` scala object from spark xml package
*/
public class XmlFileUtil {
public static RDD<String> withCharset(SparkContext context, String location, String charset, String rowTag) {
return XmlFile.withCharset(context, location, charset, rowTag);
}
}
答案 1 :(得分:0)
可悲的是,我们无法为所有XML文件使用通用解决方案。 如果您的XML文件发生了变化,代码也会发生变化。
在我的代码中,我通过数据库使用了SparkSQL的XML源代码 你可以找到它here
答案 2 :(得分:0)
如上所述是另一种方法:
object XmlSecondApproach extends App{
System.setProperty("hadoop.home.dir","C:\\hadoop\\winutils")
val sparkConf = new SparkConf().setAppName("Second Approach").setMaster("local[5]")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
val rawXmlPath = "C:\\data\\xml\\RawXML"
val objXMLDR = new XmlDataReader
var rawDf = objXMLDR.getRawXmlDataframe(sc, rawXmlPath, "book")
rawDf.registerTempTable("tempBookTable")
rawDf.printSchema()
hiveContext.udf.register("customFunction",UDFWrappedArrayConverter.checkIfWrappedArray)
//When we get more than one arrays in a parent node then we need to handle parent repeated node with explode
hiveContext.sql("SELECT book.currency.price as curPrice, " +
"markET.price as mktPrice from tempBookTable " +
"LATERAL VIEW OUTER explode(book.market) Mrkt_List as markET").show()
}
//Define another scala class
class XmlDataReader {
def getRawXmlDataframe(sc:SparkContext, xmlPath: String, rowTag: String): DataFrame = {
val xmlRddList = XmlFileUtil.withCharset(sc, xmlPath, "UTF-8", rowTag)
val xmlReader = new XmlReader()
val xmlDf = xmlReader.withAttributePrefix(Constant.ATTR_PREFIX).withValueTag(Constant.VALUE_TAG).withRowTag(rowTag.toLowerCase).xmlRdd(hiveContext, xmlRddList)
return xmlDf
}
}