非常糟糕的XML试图用Python解析

时间:2013-11-20 22:26:26

标签: python xml beautifulsoup

我在购买域名后尝试使用python来解析xml输出。到目前为止,我有:

#!/usr/bin/python

import sys
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

file = sys.argv[1]
xml = open(file).read()
soup = BeautifulStoneSoup(xml)
response = soup.find('ApiResponse')

print response

我正在使用的XML输出格式错误,绝对需要清理。

ok: [162.243.95.241] => {"cache_control": "private", "changed": false, "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<ApiResponse Status=\"OK\" xmlns=\"http://api.namecheap.com/xml.response\">\r\n  <Errors />\r\n  <Warnings />\r\n  <RequestedCommand>namecheap.domains.create</RequestedCommand>\r\n  <CommandResponse Type=\"namecheap.domains.create\">\r\n    <DomainCreateResult Domain=\"123er321test.com\" Registered=\"true\" ChargedAmount=\"8.1800\" DomainID=\"33404\" OrderID=\"414562\" TransactionID=\"679462\" WhoisguardEnable=\"false\" FreePositiveSSL=\"false\" NonRealTimeDomain=\"false\" />\r\n  </CommandResponse>\r\n  <Server>WEB1-SANDBOX1</Server>\r\n  <GMTTimeDifference>--5:00</GMTTimeDifference>\r\n  <ExecutionTime>9.008</ExecutionTime>\r\n</ApiResponse>", "content_length": "647", "content_location": "https://api.sandbox.namecheap.com/xml.response", "content_type": "text/xml; charset=utf-8", "date": "Thu, 21 Nov 2013 03:23:51 GMT", "item": "", "redirected": false, "server": "Microsoft-IIS/7.0", "status": 200, "x_aspnet_version": "4.0.30319", "x_powered_by": "ASP.NET"}

这里是pastebin上的“xml”。

我正在尝试找到ApiResponse StatusERROR

OK

1 个答案:

答案 0 :(得分:4)

那里的XML绝对没有错。

问题是XML嵌入在JSON中,JSON本身嵌入了某些我无法立即识别的对象中。 (我怀疑你刚从你用来提出请求的任何框架中抛弃了repr某种对象,这是一件很愚蠢的事情......)

因此,无论采用何种格式,都要以适当的方式解析顶级事物。 (如果您不知道它来自哪里,看起来您可以轻松地执行.partition('=>')[-1]。)然后使用json.loads解析JSON。然后获取生成的dict的['content'],这是XML,您可以使用BeautifulSoup进行解析。然后你就完成了。

换句话说:

>>> thingy = r''' ok: [162.243.95.241] => {"cache_control": "private", "changed": false, "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<ApiResponse Status=\"OK\" xmlns=\"http://api.namecheap.com/xml.response\">\r\n  <Errors />\r\n  <Warnings />\r\n  <RequestedCommand>namecheap.domains.create</RequestedCommand>\r\n  <CommandResponse Type=\"namecheap.domains.create\">\r\n    <DomainCreateResult Domain=\"123er321test.com\" Registered=\"true\" ChargedAmount=\"8.1800\" DomainID=\"33404\" OrderID=\"414562\" TransactionID=\"679462\" WhoisguardEnable=\"false\" FreePositiveSSL=\"false\" NonRealTimeDomain=\"false\" />\r\n  </CommandResponse>\r\n  <Server>WEB1-SANDBOX1</Server>\r\n  <GMTTimeDifference>--5:00</GMTTimeDifference>\r\n  <ExecutionTime>9.008</ExecutionTime>\r\n</ApiResponse>", "content_length": "647", "content_location": "https://api.sandbox.namecheap.com/xml.response", "content_type": "text/xml; charset=utf-8", "date": "Thu, 21 Nov 2013 03:23:51 GMT", "item": "", "redirected": false, "server": "Microsoft-IIS/7.0", "status": 200, "x_aspnet_version": "4.0.30319", "x_powered_by": "ASP.NET"}'''
>>> j = thingy.partition('=>')[-1]
>>> obj = json.loads(j)
>>> xml = obj['content']
>>> soup = BeautifulSoup(xml)
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<apiresponse status="OK" xmlns="http://api.namecheap.com/xml.response">
<errors></errors>
<warnings></warnings>
<requestedcommand>namecheap.domains.create</requestedcommand>
<commandresponse type="namecheap.domains.create">
<domaincreateresult chargedamount="8.1800" domain="123er321test.com" domainid="33404" freepositivessl="false" nonrealtimedomain="false" orderid="414562" registered="true" transactionid="679462" whoisguardenable="false"></domaincreateresult>
</commandresponse>
<server>WEB1-SANDBOX1</server>
<gmttimedifference>--5:00</gmttimedifference>
<executiontime>9.008</executiontime>
</apiresponse>
>>> soup.find('apiresponse')['status']
'OK'