如何在Scala中的xml节点之间“剪切和粘贴”元素?

时间:2015-11-16 23:07:34

标签: xml scala

我想从旧文件中取出所有元素('good'文件)并替换较新文件中的元素('bad'文件)。这些文件是相同的内容(具有较小的编辑更改)和相同的格式。

我是scala的新手,并且在如何进行此交换方面遇到了一些麻烦。

import scala.xml.{NodeSeq, XML, Node}
import scala.xml.transform._

object cloneImages {
  def main(args: Array[String]) = {

    val articleImageNodes = getImageNodes("data/articles_en-ca_20151022.xml")   // Seq[(id: String, nodes: NodeSeq)]
    val articleIds = articleImageNodes.map{ case (id: String, nodes ) => id }
    val badXml = XML.load("data/articles_en-ca_20151116.xml")     

    // produce 'goodXML' node by 
        // 1) removing all <media> child nodes
        // 2) inserting corresponding <media> node from articleImageNodes

    println("done")
  }

  /**
    * Pulls correct image nodes (with id string) from xml file.  (There 
    * may be multiple image nodes.)
    * @return
    */
  def getImageNodes( file: String): Seq[(String, NodeSeq)] = {

    val goodXML = XML.load( file )
    val articles = goodXML \ "contentitem"
    for (
      a <- articles;
      id <- a.attribute("id").map { _.toString() };
      imgNodes <- Option( a \\ "media" )
    ) yield {
      (id,imgNodes)
    }

  }
}

xml遵循以下通用格式:

<content>
  <contentitem type="article" id="ST00427">
    <metadata>
      <media photographer="" src="https://..." />
      <canonicalurl>http://...</canonicalurl>
      ...
    </metadata>
    <article>...</article>
    ...
  </contentitem>
  ...
</content>

1 个答案:

答案 0 :(得分:2)

我是用重写规则做的。所以,我们有一个糟糕的XML:

val oldXml = <content>
  <contentitem type="article" id="ST00427">
    <metadata>
      <media photographer="" src="https://..." />
      <media photographer="" src="https://..." />
      <media photographer="" src="https://..." />
      <media photographer="" src="https://..." />
      <canonicalurl>http://...</canonicalurl>
    </metadata>
    <article></article>
  </contentitem>
</content>

和好的部分

val goodPart = <metadata>
  <media photographer="" src="https://1" />
  <media photographer="" src="https://2" />
  <media photographer="" src="https://3" />
  <media photographer="" src="https://4" />
  <canonicalurl>http://5</canonicalurl>
</metadata>

然后我写了两个重写规则:

删除所有媒体标记的规则:

private def removeMedia() = new RewriteRule {
  override def transform(n: Node): Seq[Node] = n match {
     case e: Elem if e.label == "media" => NodeSeq.Empty
     case v => v
   }
}

插入新媒体代码的规则:

private def insertNewMedia(goodMedia: NodeSeq) = new RewriteRule {
  override def transform(n: Node): Seq[Node] = n match {
    case Elem(pref, "metadata", attrs, scope, child @ _*) =>
      Elem(pref, "metadata", attrs, scope, true, goodMedia ++: child : _*)
    case v => v
  }
}

最后一个方法是使用RuleTransformer

应用重写规则
val cleanXml = new RuleTransformer(removeMedia()).transform(oldXml)
val goodXml = new RuleTransformer(insertNewMedia(goodPart \ "media")).transform(cleanXml)

结果是:

<content>
  <contentitem type="article" id="ST00427">
    <metadata><media photographer="" src="https://1"/><media photographer="" src="https://2"/><media photographer="" src="https://3"/><media photographer="" src="https://4"/>                                        
      <canonicalurl>http://...</canonicalurl>
    </metadata>
    <article></article>
  </contentitem>
</content>