我必须监视由一整天运行的工具编写的XML文件。但XML文件只在一天结束时正确完成并关闭。
与XML流处理相同的约束:
在回答Need to read XML files as a stream using BeautifulSoup in Python时,slezica建议xml.sax
,xml.etree.ElementTree
和cElementTree
。但是,我尝试使用xml.etree.ElementTree
和cElementTree
时没有成功。还有xml.dom
,xml.parsers.expat
和lxml
,但我看不到支持“即时解析”。
我需要更明显的例子......
我目前在Linux上使用Python 2.7,但我将迁移到Python 3.x =>另请提供有关Python 3.x新功能的提示。我还使用watchdog
来检测XML文件修改=> (可选)重用watchdog
机制。也可选择支持Windows。
请提供易于理解/维护解决方案。如果它太复杂,我可能只是使用tell()
/ seek()
在文件中移动,在原始XML中使用愚蠢的文本搜索,最后使用基本的正则表达式提取值。
XML示例:
<dfxml xmloutputversion='1.0'>
<creator version='1.0'>
<program>TCPFLOW</program>
<version>1.4.6</version>
</creator>
<configuration>
<fileobject>
<filename>file1</filename>
<filesize>288</filesize>
<tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
</fileobject>
<fileobject>
<filename>file2</filename>
<filesize>352</filesize>
<tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
</fileobject>
<fileobject>
<filename>file3</filename>
<filesize>456</filesize>
...
...
使用SAX的首次测试失败:
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print 'start: name=', name
def endElement(self, name):
print 'end: name=', name
if name == 'root':
raise StopIteration
if __name__ == '__main__':
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
with open('f.xml') as f:
parser.parse(f)
外壳:
$ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
...
$ ./test-using-sax.py
start: name= dfxml
start: name= creator
start: name= program
end: name= program
start: name= version
end: name= version
Traceback (most recent call last):
File "./test-using-sax.py", line 17, in <module>
parser.parse(f)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
self.close()
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
self.feed("", isFinal = 1)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
答案 0 :(得分:1)
发布问题三小时后,没有收到回复。但我终于实现了我想要的简单例子。
我的灵感来自saaj的answer,基于xml.sax
和watchdog
。
from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax
class XmlStreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, tag, attributes):
print(tag, 'attributes=', attributes.items())
self.tag = tag
def characters(self, content):
print(self.tag, 'content=', content)
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.file = None
self.parser = xml.sax.make_parser()
self.parser.setContentHandler(XmlStreamHandler())
def on_modified(self, event):
if not self.file:
self.file = open(event.src_path)
self.parser.feed(self.file.read())
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
在脚本运行时,不要忘记touch
一个XML文件,或使用以下命令模拟即时写入:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
答案 1 :(得分:1)
从昨天开始,我发现了Peter Gibson的answer关于未记录的xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler
。
此示例与另一个示例类似但使用xml.etree.ElementTree
(和watchdog
)。
ElementTree
替换为cElementTree
时,它不起作用: - /
import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.xml_file = None
self.parser = xml.etree.ElementTree.XMLTreeBuilder()
def end_tag_event(tag):
node = self.parser._end(tag)
print 'tag=', tag, 'node=', node
self.parser._parser.EndElementHandler = end_tag_event
def on_modified(self, event):
if not self.xml_file:
self.xml_file = open(event.src_path)
buffer = self.xml_file.read()
if buffer:
self.parser.feed(buffer)
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
在脚本运行时,不要忘记touch
一个XML文件,或使用这一行脚本模拟即时编写:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
有关信息,xml.etree.ElementTree.iterparse
似乎不支持正在编写的文件。我的测试代码:
from __future__ import print_function, division
import xml.etree.ElementTree
if __name__ == '__main__':
context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
for action, elem in context:
print(action, elem.tag)
我的输出:
end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
File "./iter.py", line 9, in <module>
for action, elem in context:
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
self._root = self._parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0