Question

我试图抓一个网站。我学会了从两个资源中搜集：一个使用tag.get('href')来获取a标记的href，另一个使用tag['href']获得相同的资源。据我了解，他们都做同样的事情。但是当我尝试这段代码时：

link_list = [l.get('href') for l in soup.find_all('a')]

它适用于.get方法，但不适用于字典访问方式。

link_list = [l['href'] for l in soup.find_all('a')]

这会抛出KeyError。我是一个非常陌生的人，所以请原谅这是不是很傻。

编辑 - 这两种方法都适用于find方法而不是find_all。

Answer 1

也许HTML-string没有＆＃34; href＆＃34;？例如：

from bs4 import BeautifulSoup


doc_html = """<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>"""
soup = BeautifulSoup(doc_html, 'html.parser')
ahref = soup.find('a')
ahref.get('href')

什么都不会发生，但是

ahref['href']

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/.virtualenvs/soup_example/lib/python3.5/site-
packages/bs4/element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'href'
'href'

Answer 2

您可以让BeautifulSoup 找到仅包含现有href属性的链接。测试

您可以通过find_all()：

以两种常见方式完成此操作

link_list = [a['href'] for a in soup.find_all('a', href=True)]

或者，使用CSS selector：

link_list = [a['href'] for a in soup.select('a[href]')]

使用不同方法的美丽汤获得href

2 个答案: