Question

我有XML，我尝试使用 Scala XML API 。我有XPath查询从XML标签中检索数据。我想从<price>检索<market>代码值，但使用两个属性_id和type。我想用&&写一个条件，以便我为每个价格标签获取一个唯一值，例如其中MARKET _ID = 1 && TYPE = "A"。

供参考查找以下XML：

<publisher>
    <book _id = "0"> 
        <author _id="0">Dev</author>
        <publish_date>24 Feb 1995</publish_date>
        <description>Data Structure - C</description>
        <market _id="0" type="A">
            <price>45.95</price>            
        </market>
        <market _id="0" type="B">
            <price>55.95</price>
        </market>
    </book>
    <book _id="1"> 
        <author _id = "1">Ram</author>
        <publish_date>02 Jul 1999</publish_date>
        <description>Data Structure - Java</description>
        <market _id="1" type="A">
            <price>145.95</price>           
        </market>   
        <market _id="1" type="B">
            <price>155.95</price>           
        </market>
    </book>
</publisher>

以下代码正常运行

import scala.xml._

object XMLtoCSV extends App {

  val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")  

  val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text  //45.95
  val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text  //155.95

  println("price = " + price)
  println("price1 = " + price1)
}

输出结果为：

price = 45.9555.95
price1 = 145.95155.95

我上面的代码给了我两个值，因为我无法放入＆amp;＆amp;条件。

除了过滤我可以使用的SCALA功能之外，请提供建议。
也让我知道如何获取所有属性名称。
如果可能，请告诉我从哪里可以阅读所有这些API。

先谢谢。

Answer 1

您可以编写自定义谓词来检查多个属性：

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
  node.attribute("_id").exists(_.text == marketId) &&
  node.attribute("type").exists(_.text == marketType)
}

然后将其用作过滤器：

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95

Answer 2

如果您有兴趣获取数据的CSV文件，那么这就是编写它的方法：

(xmlload \ "book").flatMap { bk =>
  (bk \ "market").flatMap { mkt =>
    (mkt \ "price").map { p =>
      Seq(
        bk \@ "_id",
        mkt \@ "_id",
        mkt \@ "type",
        p.text.toFloat
      )
    }
  }
}.map { cols =>
  cols.mkString("\t")
}.foreach { 
  println
}

它将输出以下内容：

0       0       A       45.95
0       0       B       55.95
1       1       A       145.95
1       1       B       155.95

在编写Scala时要识别的常见模式：大多数flatMap flatMap ... map是否可以重写为for - 理解：

for {
    book <- xmlload \ "book"
    market <- book \ "market"
    price <- market \ "price"
} yield {
  val cols = Seq(
    book \@ "_id",
    market \@ "_id",
    market \@ "type",
    price.text.toFloat
  )
  println(cols.mkString("\t"))
}

Answer 3

我使用Spark和hiveContext，我能够解析xPath。

object xPathReader extends App{

    System.setProperty("hadoop.home.dir","D:\\IBM\\DB\\Hadoop\\winutils")   // Path for my winutils.exe

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new HiveContext(sc)
    val myXmlPath = "D:\\IBM\\DB\\xml"
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.

    import hiveContext.implicits._

    val xmlDf = xmlRDDList.toDF("tempXMLTable")
    xmlDf.registerTempTable("tempTable")

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()      

    /*  Output
        +------+------+
        |BookId| Price|
        +------+------+
        |     0| 55.95|
        |     1|155.95|
        +------+------+
    */
}

使用具有多个属性的scala-xml API进行解析

3 个答案: