Go:一次仅解码一个XML节点

时间:2016-01-23 00:29:00

标签: xml go

查看encoding / xml包的源代码,所有解组逻辑(解码实际的XML节点并对其进行类型化)都是unmarshal,调用它的唯一方法就是调用DecodeElement。但是,解组逻辑本身也会搜索下一个EndElement。主要原因似乎是验证。然而,这似乎代表了一个主要的设计缺陷:如果我有一个庞大的XML文件,我对它的结构充满信心,我只想一次解码一个节点,这样我就可以有效地过滤数据? RawToken()调用可用于获取当前标记,这很好,但显然,当你在其上调用DecodeElement()时,当不可避免的unmarshal()调用显然开始运行到节点时会出现错误以一种它认为不平衡的方式。

理论上似乎有可能遇到我想要解码的令牌,捕获偏移量,解码元素,寻找回到前一个位置,然后循环,但这仍然会产生巨大的影响。不必要的处理量。

一次只能解析一个节点吗?

1 个答案:

答案 0 :(得分:2)

您所描述的内容称为XML流解析,例如,由任何SAX解析器完成。好消息:encoding/xml支持,尽管它是隐藏的

您实际需要做的是创建xml.Decoder的实例,并传递io.Reader。然后,您将使用Decoder.Token()读取输入流,直到找到下一个有效 xml令牌。从那里,您可以决定下一步该做什么。

以下是一个小例子also available as gist,或者您可以 Run it on PlayGround

package main

import (
    "bytes"
    "encoding/xml"
    "fmt"
)

const (
    book = `<?xml version="1.0" encoding="UTF-8"?>
<book>
  <preface>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</preface>
  <chapter num="1" title="Foo">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</chapter>
  <chapter num="2" title="Bar">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</chapter>
</book>`
)

type Chapter struct {
    Num     int    `xml:"num,attr"`
    Title   string `xml:"title,attr"`
    Content string `xml:",chardata"`
}

func main() {

    // We emulate a file or network stream
    b := bytes.NewBufferString(book)

    // And set up a decoder
    d := xml.NewDecoder(b)

    for {

        // We look for the next token
        // Note that this only reads until the next positively identified
        // XML token in the stream
        t, err := d.Token()

        if err != nil  {
            break
        }

        switch et := t.(type) {

        case xml.StartElement:
            // We now have to inspect wether we are interested in the element
            // otherwise we will advance
            if et.Name.Local == "chapter" {
                // Most often/likely element first

                c := &Chapter{}

                // We decode the element into(automagically advancing the stream)
                // If no matching token is found, there will be an error
                // Note the search only happens within the parent.
                if err := d.DecodeElement(&c, &et); err != nil {
                    panic(err)
                }

                // We have found what we are interested in, so we print it
                fmt.Printf("%d: %s\n", c.Num, c.Title)

            } else if et.Name.Local == "book" {
                fmt.Println("Book begins!")
            }

        case xml.EndElement:

            if et.Name.Local != "book" {
                continue
            }

            fmt.Println("Finished processing book!")
        }
    }
}