在编写XML文件时(在Python中)

时间:2017-06-06 16:16:33

标签: python-2.7 python-3.x xml-parsing stream on-the-fly

我必须监视由一整天运行的工具编写的XML文件。但XML文件只在一天结束时正确完成并关闭。

与XML流处理相同的约束:

  1. 即时解析不完整的XML文件并触发操作
  2. 跟踪文件中的最后位置,以避免从头开始再次处理
  3. 在回答Need to read XML files as a stream using BeautifulSoup in Python时,slezica建议xml.saxxml.etree.ElementTreecElementTree。但是,我尝试使用xml.etree.ElementTreecElementTree时没有成功。还有xml.domxml.parsers.expatlxml,但我看不到支持“即时解析”

    我需要更明显的例子......

    我目前在Linux上使用Python 2.7,但我将迁移到Python 3.x =>另请提供有关Python 3.x新功能的提示。我还使用watchdog来检测XML文件修改=> (可选)重用watchdog机制。也可选择支持Windows。

    请提供易于理解/维护解决方案。如果它太复杂,我可能只是使用tell() / seek()在文件中移动,在原始XML中使用愚蠢的文本搜索,最后使用基本的正则表达式提取值。

    XML示例:

    <dfxml xmloutputversion='1.0'>
       <creator version='1.0'>
         <program>TCPFLOW</program>
         <version>1.4.6</version>
       </creator>
       <configuration>
         <fileobject>
           <filename>file1</filename>
           <filesize>288</filesize>
           <tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
         </fileobject>
         <fileobject>
           <filename>file2</filename>
           <filesize>352</filesize>
           <tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
         </fileobject>
         <fileobject>
           <filename>file3</filename>
           <filesize>456</filesize>
           ...
           ...
    

    使用SAX的首次测试失败:

    import xml.sax
    
    class StreamHandler(xml.sax.handler.ContentHandler):
        def startElement(self, name, attrs):
            print 'start: name=', name
        def endElement(self, name):
            print 'end:   name=', name
            if name == 'root':
                raise StopIteration
    
    if __name__ == '__main__':
        parser = xml.sax.make_parser()
        parser.setContentHandler(StreamHandler())
        with open('f.xml') as f:
            parser.parse(f)
    

    外壳:

    $ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
    ...
    $ ./test-using-sax.py
    start: name= dfxml
    start: name= creator
    start: name= program
    end:   name= program
    start: name= version
    end:   name= version
    Traceback (most recent call last):
      File "./test-using-sax.py", line 17, in <module>
        parser.parse(f)
      File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
        xmlreader.IncrementalParser.parse(self, source)
      File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
        self.close()
      File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
        self.feed("", isFinal = 1)
      File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
        self._err_handler.fatalError(exc)
      File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
        raise exception
    xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
    

2 个答案:

答案 0 :(得分:1)

发布问题三小时后,没有收到回复。但我终于实现了我想要的简单例子。

我的灵感来自saajanswer,基于xml.saxwatchdog

from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax

class XmlStreamHandler(xml.sax.handler.ContentHandler):
  def startElement(self, tag, attributes):
    print(tag, 'attributes=', attributes.items())
    self.tag = tag
  def characters(self, content):
    print(self.tag, 'content=', content)

class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
  def __init__(self):
    watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
    self.file = None
    self.parser = xml.sax.make_parser()
    self.parser.setContentHandler(XmlStreamHandler())
  def on_modified(self, event):
    if not self.file:
      self.file = open(event.src_path)
    self.parser.feed(self.file.read())

if __name__ == '__main__':
  observer = watchdog.observers.Observer()
  event_handler = XmlFileEventHandler()
  observer.schedule(event_handler, path='.')
  try:
    observer.start()
    while True:
      time.sleep(10)
  finally:
    observer.stop()
    observer.join()

在脚本运行时,不要忘记touch一个XML文件,或使用以下命令模拟即时写入:

while read line; do echo $line; sleep 1; done <in.xml >out.xml &

答案 1 :(得分:1)

从昨天开始,我发现了Peter Gibsonanswer关于未记录的xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler

此示例与另一个示例类似但使用xml.etree.ElementTree(和watchdog)。

ElementTree替换为cElementTree时,它不起作用: - /

import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree

class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
    def __init__(self):
        watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
        self.xml_file = None
        self.parser = xml.etree.ElementTree.XMLTreeBuilder()
        def end_tag_event(tag):
            node = self.parser._end(tag)
            print 'tag=', tag, 'node=', node
        self.parser._parser.EndElementHandler = end_tag_event

    def on_modified(self, event):
        if not self.xml_file:
            self.xml_file = open(event.src_path)
        buffer = self.xml_file.read()
        if buffer:
            self.parser.feed(buffer)

if __name__ == '__main__':
    observer = watchdog.observers.Observer()
    event_handler = XmlFileEventHandler()
    observer.schedule(event_handler, path='.')
    try:
        observer.start()
        while True:
            time.sleep(10)
    finally:
        observer.stop()
        observer.join()

在脚本运行时,不要忘记touch一个XML文件,或使用这一行脚本模拟即时编写:

while read line; do echo $line; sleep 1; done <in.xml >out.xml &

有关信息,xml.etree.ElementTree.iterparse似乎不支持正在编写的文件。我的测试代码:

from __future__ import print_function, division
import xml.etree.ElementTree

if __name__ == '__main__':
    context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
    for action, elem in context:
        print(action, elem.tag)

我的输出:

end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
  File "./iter.py", line 9, in <module>
    for action, elem in context:
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
    self._root = self._parser.close()
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
    self._raiseerror(v)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0