Question

在解析http://en.wikipedia.org/wiki/Israel时遇到包含文字的H2标记，但Beautiful Soup会为其返回None类型：

$ python
Python 2.7.3 (default, Apr 10 2013, 05:13:16)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> import requests
>>> from pprint import pprint
>>> response = requests.get('http://en.wikipedia.org/wiki/Israel')
>>> soup = bs4.BeautifulSoup(response.content)
>>> for h in soup.find_all('h2'):
...     pprint(str(type(h)))
...     pprint(h)
...     pprint(str(type(h.string)))
...     pprint(h.string)
...     print('--')
...                     
"<class 'bs4.element.Tag'>"
<h2>Contents</h2>    
"<class 'bs4.element.NavigableString'>"
u'Contents'          
--                   
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>
"<type 'NoneType'>"  
None                 
--                   
"<class 'bs4.element.Tag'>"
<h2><span class="mw-headline" id="History">History</span></h2>
"<class 'bs4.element.NavigableString'>"
u'History'           
--

请注意，这不是解析问题，Beautiful Soup正好解析文档。为什么第二个H2元素返回None类型？是由于字符串中的前导“”（空格）？我该如何解决这个问题？这是使用Python 2.7上的Beautiful Soup 4，Kubuntu Linux 12.10。

Answer 1

我上半场开始回答，出了什么问题......

引用documentation of bs4：“如果代码包含多个内容，则不清楚.string应引用的内容，因此.string定义为None }“。

现在另一半，如何解决它。

从同一来源再次引用：“如果标记内有多个内容，您仍然可以只查看字符串。使用.strings生成器。”。更好的是，使用.stripped_strings生成器，连接结果，我认为你会得到你想要的。

Answer 2

我认为这是因为第二个h2没有文字，而是有一个span作为一个孩子（并且该范围有另一个孩子作为其子项，使h2'的孙子。

这种解析使用基于生成器的属性，如.stripped_strings和.strings。

>>> s.find_all('h2')
[<h2>Contents</h2>, <h2><span class="mw-headline" id="Etymology"><span id="Etymology"></span> Etymology</span></h2>]
>>> list(s.find_all('h2')[-1].stripped_strings)
[u'Etymology']

美丽的汤没有找到字符串

2 个答案: