Question

我正在尝试用漂亮的汤做一个网络刮板，将在reddit上打印出最受欢迎的帖子，但我不断遇到错误。如果可能的话，请用简单的话解释一下。这是代码：

import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
article = soup.find('div', attrs={"class": "y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE"})
headline = article.a.h3.text
print(headline)

错误：

AttributeError: 'NoneType' object has no attribute 'a'

Answer 1

请尽可能简单地用语言解释。

AttributeError:

“与attribute有关的错误。”

'NoneType' object

“发生这种情况是因为您的程序中有was the special None object，”

has no attribute 'a'

“您尝试使用它进行.a，这是不可能的。”

headline = article.a.h3.text
                  ^^

这是您尝试从某事中获取.a的地方，这意味着article是None。

article = soup.find('div', attrs={"class": "y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE"})

这是article获得其值的方式，这意味着soup.find返回了None。

然后您阅读文档，了解到这意味着BeautifulSoup在HTML中找不到具有这样的<div>属性值的class标记。因此，您当然找不到嵌套的<a>标记，因为没有嵌套的标记。

机会是服务器随机生成类名；因此，您需要查看HTML中的其他内容，以便确定您实际需要的类名，而不能仅仅依赖于一次查看页面源代码时的情况。

Answer 2

您可以使用reddit的“旧”版本获取信息（新版本使用javascript，因此BeautifulSoup不会解析您看到的某些元素）：

import requests
from bs4 import BeautifulSoup


url = 'https://old.reddit.com/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

print(soup.select_one('.entry a.title').text)

打印：

Megathread: President Donald Trump announces he has tested positive for Coronavirus

或者：在网址后使用.json

import json
import requests


url = 'https://reddit.com/.json'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
data = requests.get(url, headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

print(data['data']['children'][0]['data']['title'])

注意：Reddit也有API，因此您不必使用beautifulsoup。

Answer 3

添加用户代理可能会有所帮助。像这样：

headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6'}

response = requests.get(url, headers)

您可以在此处找到用户代理：https://webscraping.com/blog/User-agents/

我正在尝试制作网络抓取工具，但不断出现此错误

3 个答案: