我有一个包含xml信封的日志文件(2种类型的xml结构:请求和响应)。我需要做的是解析这个文件,提取xml-s并将它们作为字符串放入2个数组中(第一个数组用于请求,第二个数组用于响应),所以我可以在以后解析它们。
任何想法如何在python中实现这一点?
要解析的日志文件片段(日志包含):
2014-10-31 12:27:33,600 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Sending BILL request
2014-10-31 12:27:33,601 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<transactionheader>
<username>XXX</username>
<password>XXX</password>
<time>31/10/2014 12:27:33</time>
<clientreferencenumber>123</clientreferencenumber>
<numberrequests>3</numberrequests>
<information>Description</information>
<postbackurl>http://localhost/status</postbackurl>
</transactionheader>
<transactiondetails>
<items>
<item id="1" client="XXX1" keyword="test"/>
<item id="2" client="XXX2" keyword="test"/>
<item id="3" client="XXX3" keyword="test"/>
</items>
</transactiondetails>
</request>
2014-10-31 12:27:34,487 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Response code 200 for bill request
2014-10-31 12:27:34,489 INFO Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<serverreferencenumber>XXX123XXX</serverreferencenumber>
<clientreferencenumber>123</clientreferencenumber>
<information>Queued for Processing</information>
<status>OK</status>
</response>
非常感谢您的回复!
此致 罗伯特
答案 0 :(得分:2)
正如@Paco和@Lord_Gestalter建议的那样,您可以使用xml.etree
并替换文件中的非XML元素,如下所示:
# I use re to substitute non-XML elements
import re
# then use xml module as a parser
import xml.etree.ElementTree as ET
# read your file and store in string 's'
with open('yourfilehere','r') as f:
s = f.read()
# then remove non-XML element with re
# I also remove <?xml ...?> part as your file consists of multiple xml logs
s = re.sub(r'<\?xml.*?>', '', ''.join(re.findall(r'<.*>', s)))
# wrap your s with a root element
s = '<root>'+s+'</root>'
# parse s with ElementTree
tree = ET.fromstring(s)
tree
<Element 'root' at 0x7f2ab877e190>
如果您不关心xml解析器,只是想要请求&#39; &安培; &#39;响应&#39; string,使用re.search
with open('yourfilehere','r') as f:
s = f.read()
# put the string of both request and response into 'req' and 'res'
# or you need to construct a better re.search if you have multiple requests, responses
req = [re.search(r'<request.*\/request>', s).group()]
res = [re.search(r'<response.*\/response>', s).group()]
req
['<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><transactionheader><username>XXX</username><password>XXX</password><time>31/10/2014 12:27:33</time><clientreferencenumber>123</clientreferencenumber><numberrequests>3</numberrequests><information>Description</information><postbackurl>http://localhost/status</postbackurl></transactionheader><transactiondetails><items><item id="1" client="XXX1" keyword="test"/><item id="2" client="XXX2" keyword="test"/><item id="3" client="XXX3" keyword="test"/></items></transactiondetails></request>']
res
['<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><serverreferencenumber>XXX123XXX</serverreferencenumber><clientreferencenumber>123</clientreferencenumber><information>Queued for Processing</information><status>OK</status></response>']