英国国家天然气系统发布了大量可以从SOAP服务器访问的数据,下面显示了返回数据(用于LNG)的示例。我已经编写了代码来生成请求并处理响应,但我正在试图提取返回的信息。目标是将数据上传到后端数据库或Pandas数据帧。
在之前的代码中,我只是使用XPATH遍历XML,然后遍历标记并提取出子数据。因此,我希望提取:
GetPublicationDataWMResult, ApplicableAt, ApplicableFor, Value, ...
LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...
LNG Capacity,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 6515042480, ...
尝试使用XPATH遍历子项(/ Envelope / Body / GetPublicationDataWMResponse / GetPublicationDataWMResult /)时失败了。
如果我通过添加一系列字符串删除来清理代码,但逻辑有效,但这是次优的并且将来必然会中断。
示例代码:
import requests
from lxml import objectify
def getXML():
toDate = "2016-03-12"
fromDate = "2016-03-12"
dateType = "gasday"
url="http://marketinformation.natgrid.co.uk/MIPIws-public/public/publicwebservice.asmx"
headers = {'content-type': 'application/soap+xml; charset=utf-8'}
body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetPublicationDataWM xmlns="http://www.NationalGrid.com/MIPI/">
<reqObject>
<LatestFlag>Y</LatestFlag>
<ApplicableForFlag>Y</ApplicableForFlag>
<ToDate>%s</ToDate>
<FromDate>%s</FromDate>
<DateType>%s</DateType>
<PublicationObjectNameList>
<string>LNG Stock Level</string>
<string>LNG, Daily Aggregated Available Capacity, D+1</string>
</PublicationObjectNameList>
</reqObject>
</GetPublicationDataWM>
</soap12:Body>
</soap12:Envelope>""" % (toDate, fromDate,dateType)
response = requests.post(url,data=body,headers=headers)
return response.content
root = objectify.fromstring(getXML())
返回XML:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetPublicationDataWMResponse
xmlns="http://www.NationalGrid.com/MIPI/">
<GetPublicationDataWMResult>
<CLSMIPIPublicationObjectBE>
<PublicationObjectName>LNG Stock Level</PublicationObjectName>
<PublicationObjectData>
<CLSPublicationObjectDataBE>
<ApplicableAt>2016-03-13T15:00:07Z</ApplicableAt>
<ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
<Value>7050.42286</Value>
<GeneratedTimeStamp>2016-03-13T15:56:00Z</GeneratedTimeStamp>
<QualityIndicator></QualityIndicator>
<Substituted>N</Substituted>
<CreatedDate>2016-03-13T15:56:28Z</CreatedDate>
</CLSPublicationObjectDataBE>
</PublicationObjectData>
</CLSMIPIPublicationObjectBE>
<CLSMIPIPublicationObjectBE>
<PublicationObjectName>LNG Capacity</PublicationObjectName>
<PublicationObjectData>
<CLSPublicationObjectDataBE>
<ApplicableAt>2016-03-12T15:30:00Z</ApplicableAt>
<ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
<Value>6515042480</Value>
<GeneratedTimeStamp>2016-03-12T16:00:00Z</GeneratedTimeStamp>
<QualityIndicator></QualityIndicator>
<Substituted>N</Substituted>
<CreatedDate>2016-03-12T16:00:20Z</CreatedDate>
</CLSPublicationObjectDataBE>
</PublicationObjectData>
</CLSMIPIPublicationObjectBE>
</GetPublicationDataWMResult>
</GetPublicationDataWMResponse>
</soap:Body>
</soap:Envelope>
答案 0 :(得分:1)
使用您现有的代码我刚刚添加了这个:
res= getXML()
from bs4 import BeautifulSoup
soup = BeautifulSoup(res, 'html.parser')
searchTerms= ['PublicationObjectName','ApplicableAt','ApplicableFor','Value']
# LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...
for st in searchTerms:
print st+'\t',
print soup.find(st.lower()).contents[0]
输出:
PublicationObjectName LNG Stock Level
ApplicableAt 2016-03-13T15:00:07Z
ApplicableFor 2016-03-12T00:00:00Z
Value 7050.42286
答案 1 :(得分:1)
这是XML + XPath主题中的常见问题解答,涉及带有默认命名空间的XML。
声明默认命名空间的XML元素及其后代元素(不带前缀)隐式继承相同的默认命名空间。在XPath表达式中,要引用命名空间中的元素,您需要使用已映射到相应命名空间URI的前缀。使用lxml
代码将如下所示:
root = etree.fromstring(getXML())
# map prefix 'd' to the default namespace URI
ns = { 'd': 'http://www.NationalGrid.com/MIPI/'}
publication_objects = root.xpath('//d:CLSMIPIPublicationObjectBE', namespaces=ns)
for obj in publication_objects:
name = obj.find('d:PublicationObjectName', ns).text
data = obj.find('d:PublicationObjectData/d:CLSPublicationObjectDataBE', ns)
applicable_at = data.find('d:ApplicableAt', ns).text
applicable_for = data.find('d:ApplicableFor', ns).text
# todo: extract other relevant data and process as needed