Question

我有以下汤：

<a href="some_url">next</a>
<span class="class">...</span>

由此我想提取href，"some_url"

如果我只有一个标签，我可以这样做，但这里有两个标签。我也可以得到文字'next'，但这不是我想要的。

此外，是否有一个很好的描述API的例子。我正在使用the standard documentation，但我正在寻找更有条理的东西。

Answer 1

您可以通过以下方式使用find_all查找具有a属性的每个href元素，并打印每个元素：

from BeautifulSoup import BeautifulSoup

html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

输出结果为：

Found the URL: some_url
Found the URL: another_url

请注意，如果您使用的是旧版本的BeautifulSoup（版本4之前），则此方法的名称为findAll。在版本4中，BeautifulSoup的方法名称为were changed to be PEP 8 compliant，因此您应该使用find_all代替。

如果您想要所有带有href的标记，则可以省略name参数：

href_tags = soup.find_all(href=True)

BeautifulSoup得到href

1 个答案: