Question

我正在尝试使用 python 从站点中抓取某些内容。例如，对该视频（网址）的观看次数总是返回“无”。我究竟做错了什么？这是代码：

from bs4 import BeautifulSoup
import requests

url = 'https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')
views = soup.body.find(class_='view-count style-scope ytd-video-view-count-renderer')
print(views)

谢谢！（顺便说一句，当我尝试视频中显示的代码时，它工作正常）

Answer 1

页面是动态加载的，requests 不支持动态加载的页面。但是，数据以 JSON 格式提供，您可以使用 re/json 模块获取正确的数据。

例如，获取“观看次数”：

import re
import json
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer"
soup = BeautifulSoup(requests.get(url).text, "html.parser")

# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*?});", soup).group(1)
data = json.loads(data)

print(
    data["contents"]["twoColumnWatchNextResults"]["results"]["results"]["contents"][0][
        "videoPrimaryInfoRenderer"
    ]["viewCount"]["videoViewCountRenderer"]["viewCount"]["simpleText"]
)

输出：

124 views

变量 data 包含 Python 字典 (dict) 中的所有数据，用于打印您可以使用的所有数据：

print(json.dumps(data, indent=4))

输出（截断）：

{
    "responseContext": {
        "serviceTrackingParams": [
            {
                "service": "CSI",
                "params": [
                    {
                        "key": "c",
                        "value": "WEB"
                    },
                    {
                        "key": "cver",
                        "value": "2.20210701.07.00"
                    },
                    {
                        "key": "yt_li",
                        "value": "0"
                    },
                    {
                        "key": "GetWatchNext_rid",
                        "value": "0x1d62a299beac9e1f"
                    }
                ]
            },
            {
                "service": "GFEEDBACK",
                "params": [
                    {
                        "key": "logged_in",
                        "value": "0"
                    },
                    {
                        "key": "e",
                        "value": "24037443,24058293,24058128,24003103,24042870,23882685,24023960,23944779,24027649,24046896,24059898,24049577,23983296,23966208,24056265,23891346,1714258,24049575,24045412,24003105,23999405,24051884,23891344,23986022,24049573,24056839,24053866,24058240,23744176,23998056,24010336,24037586,23934970,23974595,23735348,23857950,24036947,24051353,24038425,23990875,24052245,24063702,24058380,23983813,24058812,24026834,23996830,23946420,24001373,24049820,24030040,24062848,23968386,24027689,24004644,23804281,24049569,23973490,24044110,23884386,24012512,24044124,24059521,23918597,24007246,24049567,24022729,24037794"
                    }
                ]
            },
            {
                "service": "GUIDED_HELP",
                "params": [
                    {
                        "key": "logged_in",
                        "value": "0"
                    }
                ]
            },
            {
                "service": "ECATCHER",
                "params": [
                    {
                        "key": "client.version",
                        "value": "2.20210701"
                    },
                    {
                        "key": "client.name",
                        "value": "WEB"
                    }
                ]
            }
        ],
        "mainAppWebResponseContext": {
            "loggedOut": true
        },
        "webResponseContextExtensionData": {
            "ytConfigData": {
                "visitorData": "CgtoanprT1pPbmtWTSjYk46HBg%3D%3D",
                "rootVisualElementType": 3832
            },

Answer 2

当网站动态加载时，我通常会尝试查看 API 请求（从开发工具的网络选项卡）。我在 udemy、skillshare 等网站上取得了成功，但在 youtube 上却没有。所以在这种情况下，我会使用 youtube 官方 API。它非常易于使用，并且在 github 上有大量代码示例。有了它，您只需请求您的数据并获得 json 响应。您可以使用 response.json() 转换为字典。或者另一种选择是使用 selenium，这不是我喜欢的解决方案，而且非常耗费资源和时间。从 API 请求比抓取或地球上任何其他解决方案更快。当某些东西不提供 API 时，您需要抓取

python中的网络抓取返回“无”

2 个答案: