如何使用python BeautifulSoup

时间:2017-01-04 11:54:44

标签: python python-3.x parsing xml-parsing beautifulsoup

我在xml文件中有数千个条目,每个条目都有一个命名空间名称。我要解析的一个简洁示例如下。

我要解析的简洁示例

<d:entry d:title="Buddism" class="entry">
<span class="ps"> noun </span>
<span class="pinyin"> fojiao </span>
</d:entry>
<d:entry d:title="hew" class="entry">
<span class="ps"> verb </span>
<span class="pinyin"> jue </span>
</d:entry>
<d:entry d:title="roost" class="entry">
<span class="ps"> noun </span>
<span class="pinyin"> qixidi </span>
</d:entry>

标题

我尝试使用BeautifulSoup4通过以下步骤解析它,但没有任何反应。

➜  ~  python3
Python 3.5.2 (default, Jul 28 2016, 21:28:00)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> xmlstr = """
... <d:entry d:title="Buddism" class="entry"><span class="ps"> noun </span><span class="pinyin"> fojiao </span></d:entry><d:entry d:title="hew" class="entry"><span class="ps"> verb </span><span class="pinyin"> jue </span></d:entry><d:entry d:title="roost" class="entry"><span class="ps"> noun </span><span class="pinyin"> qixidi </span></d:entry>"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(xmlstr, "xml")
>>> t = soup.find(r'd:title="hew"')
>>> t
>>> t.ps
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'ps'
>>> type(t)
<class 'NoneType'>

如何在BeautifulSoup或类似工具中解析它?我不想用正则表达式手动解析它。

1 个答案:

答案 0 :(得分:1)

soup = bs4.BeautifulSoup(xmlstr, 'lxml')
soup.find(attrs={'d:title':'hew'}).find(class_='ps')

出:

<span class="ps"> verb </span>
  1. 首先,我建议使用'lxml'
  2. 其次,你找到的是     atrribute,而不是标签名称,你不能做soup.attrs