Question

我正在创建一个网络抓取工具，该工具会将文章下载到txt文件中。我用bs4创建了汤，并取出了特定的html片段，其中包含要下载的文章的所需url：

>>>prevLink = soup2.select('.previous_post')
>>>prevLink
[<span class="previous_post">Previous Post: <a href="http://www.mrmoneymustache.com/2018/11/08/honey-badger-entrepreneur/" rel="prev">An Interview With The Man Who Never Needed a Real Job</a></span>]

到目前为止（我认为）很好。然后，我尝试使用.get（'href'）拔出链接，但它返回“ none”。

>>>print(prevLink[0].get('href'))
None

但是，当我使用.get（'class'）选择类时，它似乎可以工作。

>>> print(prevLink[0].get('class'))
['previous_post']

我不明白为什么.get（'class'）的行为与.get（'href'）不同。感谢您的光临。

Answer 1

prevLink实际上不是引用链接，而是span元素。

只需使用选择器更深入了解a元素：

prevLink = soup2.select_one('.previous_post > a')
print(prevLink.get('href'))

BeautifulSoup .get不返回“ href”

1 个答案: