我想从旧文件中取出所有元素('good'文件)并替换较新文件中的元素('bad'文件)。这些文件是相同的内容(具有较小的编辑更改)和相同的格式。
我是scala的新手,并且在如何进行此交换方面遇到了一些麻烦。
import scala.xml.{NodeSeq, XML, Node}
import scala.xml.transform._
object cloneImages {
def main(args: Array[String]) = {
val articleImageNodes = getImageNodes("data/articles_en-ca_20151022.xml") // Seq[(id: String, nodes: NodeSeq)]
val articleIds = articleImageNodes.map{ case (id: String, nodes ) => id }
val badXml = XML.load("data/articles_en-ca_20151116.xml")
// produce 'goodXML' node by
// 1) removing all <media> child nodes
// 2) inserting corresponding <media> node from articleImageNodes
println("done")
}
/**
* Pulls correct image nodes (with id string) from xml file. (There
* may be multiple image nodes.)
* @return
*/
def getImageNodes( file: String): Seq[(String, NodeSeq)] = {
val goodXML = XML.load( file )
val articles = goodXML \ "contentitem"
for (
a <- articles;
id <- a.attribute("id").map { _.toString() };
imgNodes <- Option( a \\ "media" )
) yield {
(id,imgNodes)
}
}
}
xml遵循以下通用格式:
<content>
<contentitem type="article" id="ST00427">
<metadata>
<media photographer="" src="https://..." />
<canonicalurl>http://...</canonicalurl>
...
</metadata>
<article>...</article>
...
</contentitem>
...
</content>
答案 0 :(得分:2)
我是用重写规则做的。所以,我们有一个糟糕的XML:
val oldXml = <content>
<contentitem type="article" id="ST00427">
<metadata>
<media photographer="" src="https://..." />
<media photographer="" src="https://..." />
<media photographer="" src="https://..." />
<media photographer="" src="https://..." />
<canonicalurl>http://...</canonicalurl>
</metadata>
<article></article>
</contentitem>
</content>
和好的部分
val goodPart = <metadata>
<media photographer="" src="https://1" />
<media photographer="" src="https://2" />
<media photographer="" src="https://3" />
<media photographer="" src="https://4" />
<canonicalurl>http://5</canonicalurl>
</metadata>
然后我写了两个重写规则:
删除所有媒体标记的规则:
private def removeMedia() = new RewriteRule {
override def transform(n: Node): Seq[Node] = n match {
case e: Elem if e.label == "media" => NodeSeq.Empty
case v => v
}
}
插入新媒体代码的规则:
private def insertNewMedia(goodMedia: NodeSeq) = new RewriteRule {
override def transform(n: Node): Seq[Node] = n match {
case Elem(pref, "metadata", attrs, scope, child @ _*) =>
Elem(pref, "metadata", attrs, scope, true, goodMedia ++: child : _*)
case v => v
}
}
最后一个方法是使用RuleTransformer
val cleanXml = new RuleTransformer(removeMedia()).transform(oldXml)
val goodXml = new RuleTransformer(insertNewMedia(goodPart \ "media")).transform(cleanXml)
结果是:
<content>
<contentitem type="article" id="ST00427">
<metadata><media photographer="" src="https://1"/><media photographer="" src="https://2"/><media photographer="" src="https://3"/><media photographer="" src="https://4"/>
<canonicalurl>http://...</canonicalurl>
</metadata>
<article></article>
</contentitem>
</content>