Question

我想检索SDMX文件中给出的数据（如https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its）。我试图使用BeautifulSoup，但看起来，它没有看到标签。在下面的代码中

import urllib2
from bs4 import BeautifulSoup 
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx"
html_source = urllib2.urlopen(url).read()
soup = BeautifulSoup(html_source, 'lxml')
ts_series = soup.findAll("bbk:Series")

给了我一个空的对象。

BS4是错误的工具，还是（更有可能）我做错了什么？提前致谢

Answer 1

soup.findAll("bbk:series")会返回结果。

事实上，在这种情况下，即使你使用lxml作为解析器，BeautifulSoup仍然将其解析为html，因为html标签是案例性的，BeautifulSoup会对所有标签进行下调，因此soup.findAll("bbk:series")可以正常工作。请参阅官方文档中的Other parser problems。

如果您要将其解析为xml，请改用soup = BeautifulSoup(html_source, 'xml')。它还使用lxml，因为lxml是BeautifulSoup唯一的xml解析器。现在，您可以使用ts_series = soup.findAll("Series")来获取结果，因为beautifulSoup将删除命名空间部分bbk。

带有SDMX的Python BS4

1 个答案: