使用scala.xml.pull提取节点及其所有子节点的最佳方法?

时间:2012-12-02 16:11:39

标签: scala xmlpullparser scala-xml

我正在使用scala.xml.pull来解析变大的xml文件。这对于事件处理非常有用,但我想要做的是让我的解析器为特定节点咳出一个小文档,我看不到一种简单的方法,或者至少不是“scala”方式。 / p>

我在想我构建一个这样的搜索函数,它可以使用迭代器来查找与我的标记匹配的EvElemStart事件:

def seek(tag: String) = {
  while (it.hasNext) {
    it.next match {
      case EvElemStart(_, `tag`, _, _) => 

之后我不太清楚了。是否有一种简单的方法可以将此标记的所有子项都捕获到文档中,而不必遍历XMLEventReader弹出的每个事件?

我最终要找的是一个扫描文件的过程,并为我可以使用普通scala xml处理处理的特定标记或标记集的每个实例发出一个xml元素(一个Elem?)。 / p>

2 个答案:

答案 0 :(得分:2)

这就是我最终做的事情。 slurp(tag)寻找标签的下一个实例,并返回该标签的完整节点树。

def slurp(tag: String): Option[Node] = {
  while (it.hasNext) {
    it.next match {
      case EvElemStart(pre, `tag`, attrs, _) => return Some(subTree(tag, attrs))
      case _ => 
    }
  }
  return None
}

def subTree(tag: String, attrs: MetaData): Node = {
  var children = List[Node]()

  while (it.hasNext) {
    it.next match {
      case EvElemStart(_, t, a, _) => {
        children = children :+ subTree(t, a)
      }
      case EvText(t) => {
        children = children :+ Text(t)
      }
      case EvElemEnd(_, t) => {
        return new Elem(null, tag, attrs, xml.TopScope, children: _*)
      }
      case _ =>
    }
  }
  return null   // this shouldn't happen with good XML
}

答案 1 :(得分:2)

基于Jim Baldwin的答案,我创建了一个迭代器,它获取特定级别的节点(而不是特定的标签):

import scala.io.Source
import scala.xml.parsing.FatalError
import scala.xml.{Elem, MetaData, Node, Text, TopScope}
import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}


/**
  * Streaming XML parser which yields Scala XML Nodes.
  *
  * Usage:
  *
  * val it = new XMLNodeIterator(pathToXML, 1)
  *
  * Will give you all book-nodes of
  *
  * <?xml version="1.0" encoding="UTF-8"?>
  * <books>
  *     <book>
  *         <title>A book title</title>
  *     </book>
  *     <book>
  *         <title>Another book title</title>
  *     </book>
  * </books>
  *
  */
class StreamingXMLParser(filename: String, wantedNodeLevel: Int) extends Iterator[Node] {
    val file = Source.fromFile(filename)
    val it = new XMLEventReader(file)
    var currentLevel = 0
    var nextEvent = it.next // peek into next event

    def getNext() = {
        val currentEvent = nextEvent
        nextEvent = it.next
        currentEvent
    }

    def hasNext = {
        while (it.hasNext && !nextEvent.isInstanceOf[EvElemStart]) {
            getNext() match {
                case EvElemEnd(_, _) => {
                    currentLevel -= 1
                }
                case _ => // noop
            }
        }
        it.hasNext
    }

    def next: Node = {
        if (!hasNext) throw new NoSuchElementException

        getNext() match {
            case EvElemStart(pre, tag, attrs, _) => {
                if (currentLevel == wantedNodeLevel) {
                    currentLevel += 1
                    getElemWithChildren(tag, attrs)
                }
                else {
                    currentLevel += 1
                    next
                }
            }
            case EvElemEnd(_, _) => {
                currentLevel -= 1
                next
            }
            case _ => next
        }
    }

    def getElemWithChildren(tag: String, attrs: MetaData): Node = {
        var children = List[Node]()

        while (it.hasNext) {
            getNext() match {
                case EvElemStart(_, t, a, _) => {
                    currentLevel += 1
                    children = children :+ getElemWithChildren(t, a)
                }
                case EvText(t) => {
                    children = children :+ Text(t)
                }
                case EvElemEnd(_, _) => {
                    currentLevel -= 1
                    return new Elem(null, tag, attrs, TopScope, true, children: _*)
                }
                case _ =>
            }
        }
        throw new FatalError("Failed to parse XML.")
    }
}