Question

我绝对是Python编程的初学者。我正在使用Python中的bs4模块在某些网站上练习网页抓取。

在这里，我想从网站中获取链接，然后对其进行迭代，因为当我们打开网站上的每个链接时，它会从那里转到新的网页，我想提取代理名称。现在有很多链接，因此我尝试先将它们提取到列表中，然后遍历它们。但是我的列表返回空列表。请告诉我我在哪里做错了，应该怎么做。

from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://www.mcgrath.com.au/offices', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.select('.align w-1140 p-none a')]
print(links) ````

Answer 1

您使用了错误的选择器。相反，您应该使用：.align.w-1140.p-none > a。喜欢：

links = [item['href'] for item in soup.select('.align.w-1140.p-none > a') if item['href'] != '/']

这是因为<div class="align w-1140">与加入的CSS类匹配。

然后要获取代理商的电子邮件，您可以执行以下操作：

res = requests.get('https://www.mcgrath.com.au/offices/178-annerley-yeronga', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
agents_mails = [item['href'] for item in soup.select('.agent a[href^=mailto]')]

从网站中获取网页链接，并通过这些链接进行迭代以获取更多信息

1 个答案: