Question

我正在尝试使用beautifulsoup为维基百科的人们过生日。例如http://en.wikipedia.org/wiki/Ezra_Taft_Benson的生日是1899年8月4日。为了到达bday，我使用以下代码：

bday = url.find("span", class_="bday")

然而，它正在拾取html代码中bday作为另一个标记的一部分出现的实例。即<span class="bday dtstart published updated">1985-11-10 </span>。

有没有办法将确切的类标记与bday匹配？

我希望问题很明确，因为目前我的bday是1985-11-10，这不是正确的日期。

Answer 1

当BeautifulSoup的所有其他匹配方法都失败时，您可以使用一个带有单个参数的函数（标记）：

>>> url.find(lambda tag: tag.name == 'span' and tag.get('class', []) == ['bday'])
<span class="bday">1899-08-04</span>

以上搜索span标记，其class属性是单个元素的列表（'bday'）。

Answer 2

我会这样做：

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://en.wikipedia.org/wiki/Ezra_Taft_Benson'
file_pointer = urllib.urlopen(url)
html_object = BeautifulSoup(file_pointer)

bday = html_object('span',{'class':'bday'})[0].contents[0]

这会返回1899-08-04作为bday

的值

Answer 3

尝试将lxml与beautifulsoup解析器一起使用。以下查找只有<span>类的bday标记（在此页面的情况下只有一个）：

>>> from lxml.html.soupparser import fromstring
>>> root = fromstring(open('Ezra_Taft_Benson'))
>>> span_bday_nodes = root.findall('.//span[@class="bday"]')
[<Element span at 0x1be9290>]
>>> span_bday_node[0].text
'1899-08-04'

class属性的多个值

3 个答案: