python中的网络抓取返回“无”

时间:2021-07-05 21:54:01

标签: python beautifulsoup

我正在尝试使用 python 从站点中抓取某些内容。例如,对该视频(网址)的观看次数总是返回“无”。我究竟做错了什么?这是代码:

from bs4 import BeautifulSoup
import requests

url = 'https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')
views = soup.body.find(class_='view-count style-scope ytd-video-view-count-renderer')
print(views)

谢谢! (顺便说一句,当我尝试视频中显示的代码时,它工作正常)

2 个答案:

答案 0 :(得分:1)

页面是动态加载的,requests 不支持动态加载的页面。但是,数据以 JSON 格式提供,您可以使用 re/json 模块获取正确的数据。

例如,获取“观看次数”:

import re
import json
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer"
soup = BeautifulSoup(requests.get(url).text, "html.parser")

# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*?});", soup).group(1)
data = json.loads(data)

print(
    data["contents"]["twoColumnWatchNextResults"]["results"]["results"]["contents"][0][
        "videoPrimaryInfoRenderer"
    ]["viewCount"]["videoViewCountRenderer"]["viewCount"]["simpleText"]
)

输出:

124 views

变量 data 包含 Python 字典 (dict) 中的所有数据,用于打印您可以使用的所有数据:

print(json.dumps(data, indent=4))

输出(截断):

{
    "responseContext": {
        "serviceTrackingParams": [
            {
                "service": "CSI",
                "params": [
                    {
                        "key": "c",
                        "value": "WEB"
                    },
                    {
                        "key": "cver",
                        "value": "2.20210701.07.00"
                    },
                    {
                        "key": "yt_li",
                        "value": "0"
                    },
                    {
                        "key": "GetWatchNext_rid",
                        "value": "0x1d62a299beac9e1f"
                    }
                ]
            },
            {
                "service": "GFEEDBACK",
                "params": [
                    {
                        "key": "logged_in",
                        "value": "0"
                    },
                    {
                        "key": "e",
                        "value": "24037443,24058293,24058128,24003103,24042870,23882685,24023960,23944779,24027649,24046896,24059898,24049577,23983296,23966208,24056265,23891346,1714258,24049575,24045412,24003105,23999405,24051884,23891344,23986022,24049573,24056839,24053866,24058240,23744176,23998056,24010336,24037586,23934970,23974595,23735348,23857950,24036947,24051353,24038425,23990875,24052245,24063702,24058380,23983813,24058812,24026834,23996830,23946420,24001373,24049820,24030040,24062848,23968386,24027689,24004644,23804281,24049569,23973490,24044110,23884386,24012512,24044124,24059521,23918597,24007246,24049567,24022729,24037794"
                    }
                ]
            },
            {
                "service": "GUIDED_HELP",
                "params": [
                    {
                        "key": "logged_in",
                        "value": "0"
                    }
                ]
            },
            {
                "service": "ECATCHER",
                "params": [
                    {
                        "key": "client.version",
                        "value": "2.20210701"
                    },
                    {
                        "key": "client.name",
                        "value": "WEB"
                    }
                ]
            }
        ],
        "mainAppWebResponseContext": {
            "loggedOut": true
        },
        "webResponseContextExtensionData": {
            "ytConfigData": {
                "visitorData": "CgtoanprT1pPbmtWTSjYk46HBg%3D%3D",
                "rootVisualElementType": 3832
            },

答案 1 :(得分:0)

当网站动态加载时,我通常会尝试查看 API 请求(从开发工具的网络选项卡)。我在 udemy、skillshare 等网站上取得了成功,但在 youtube 上却没有。所以在这种情况下,我会使用 youtube 官方 API。它非常易于使用,并且在 github 上有大量代码示例。有了它,您只需请求您的数据并获得 json 响应。您可以使用 response.json() 转换为字典。或者另一种选择是使用 selenium,这不是我喜欢的解决方案,而且非常耗费资源和时间。从 API 请求比抓取或地球上任何其他解决方案更快。当某些东西不提供 API 时,您需要抓取