我使用Python3 Beautiful Soup废弃网站。这是我得到的XML数据。
<?xml version="1.0" encoding="utf-8"?>
<title type="text">MATERIALSET('R100100100')</title>
<updated>2018-05-11T04:28:47Z</updated>
<category term="ZPOC_BOT_PUR_GRP_SRV.MATERIAL" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"/>
<link href="MATERIALSET('R100100100')" rel="self" title="MATERIAL"/>
<content type="application/xml">
<m:properties>
<d:MATNR>R100100100</d:MATNR>
<d:WERKS>Z100</d:WERKS>
<d:MENGE> 1.000</d:MENGE>
<d:EEIND>29.06.2018</d:EEIND>
<d:BANFN>5000000041</d:BANFN>
</m:properties>
</content>
</entry>
我只想在d:BANFN中提取数据。如果我直接写出soup.select(&#39; d:BANFN&#34;),则显示错误为'nth_child_of_type&#39;。我确实在Stackoverflow中经历了一些问题,这里是链接 - Getting the nth element using BeautifulSoup 和 selecting second child in beautiful soup with soup.select? 但没有任何帮助。 请帮忙。
答案 0 :(得分:1)
在xml文件中应该有entry
属性的起始标记,然后只能解析xml文件:
<!-- Sample.xml contains following data: -->
<?xml version="1.0" encoding="utf-8"?>
<entry>
<title type="text">MATERIALSET('R100100100')</title>
<updated>2018-05-11T04:28:47Z</updated>
<category term="ZPOC_BOT_PUR_GRP_SRV.MATERIAL" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"/>
<link href="MATERIALSET('R100100100')" rel="self" title="MATERIAL"/>
<content type="application/xml">
<m:properties>
<d:MATNR>R100100100</d:MATNR>
<d:WERKS>Z100</d:WERKS>
<d:MENGE> 1.000</d:MENGE>
<d:EEIND>29.06.2018</d:EEIND>
<d:BANFN>5000000041</d:BANFN>
</m:properties>
</content>
</entry>
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read() # xml content stored in this variable and decode to utf-8
soup = BeautifulSoup(content, 'lxml') #parse content to BeautifulSoup Module
print("BANFN value : {}".format([ item.text for item in soup.find_all("d:banfn")][0])) #required result
#output:
BANFN value : 5000000041