Question

我正在尝试剪贴的网站的链接，并且在剪贴之后，我还想查看我剪贴的链接是仅仅是文章还是包含更多链接，如果确实如此，我也想剪贴这些链接。我正在尝试使用BeautifulSoup 4来实现它，这是到目前为止我拥有的代码：

import requests
from bs4 import BeautifulSoup
url ='https://www.lbbusinessjournal.com/'
try:
    r = requests.get(url, headers={'User-Agent': user_agent})
    soup = BeautifulSoup(r.text, 'html.parser')
    for post in soup.find_all(['h3', 'li'], class_=['entry-title td-module-title', 'menu-item']):
        link = post.find('a').get('href')
        print(link)
        r = requests.get(link, headers={'User-Agent': user_agent})
        soup1 = BeautifulSoup(r.text, 'html.parser')
        for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):
            link1 = post1.find('a').get('href')
            print(link1)
except Exception as e:
    print(e)

我想要页面 https://www.lbbusinessjournal.com/ 上的链接，并在我从该页面获得的链接内（例如 https://www.lbbusinessjournal.com/news/ ）中寻找可能的链接，我也希望 https://www.lbbusinessjournal.com/news/ 中的链接。到目前为止，我仅从主页上获得链接。

Answer 1

在您的raise e子句中尝试except，您将看到错误消息

AttributeError：'NoneType'对象没有属性'get'

行link1 = post1.find('a').get('href')中的

，其中post1.find('a')返回None的原因-这是因为您检索的HTML h3元素中至少有一个没有{{1 }}元素-实际上，看起来该链接已在HTML中注释掉。

相反，您应该将此a调用分为两个步骤，并在尝试获取post1.find('a').get('href')属性之前检查post1.find('a')返回的元素是否不是None，即：

'href'

通过此更改运行代码的输出：

for post1 in soup1.find_all('h3', class_='entry-title td-module-title'):                                                     
    element = post1.find('a')                                           
    if element is not None:                                             
        link1 = element.get('href')                                     
        print(link1)

我正在尝试抓取网站链接，还抓取已经抓取的链接中的链接

1 个答案: