我正在尝试使用 python 从站点中抓取某些内容。例如,对该视频(网址)的观看次数总是返回“无”。我究竟做错了什么?这是代码:
from bs4 import BeautifulSoup
import requests
url = 'https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
views = soup.body.find(class_='view-count style-scope ytd-video-view-count-renderer')
print(views)
谢谢! (顺便说一句,当我尝试视频中显示的代码时,它工作正常)
答案 0 :(得分:1)
页面是动态加载的,requests
不支持动态加载的页面。但是,数据以 JSON 格式提供,您可以使用 re
/json
模块获取正确的数据。
例如,获取“观看次数”:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*?});", soup).group(1)
data = json.loads(data)
print(
data["contents"]["twoColumnWatchNextResults"]["results"]["results"]["contents"][0][
"videoPrimaryInfoRenderer"
]["viewCount"]["videoViewCountRenderer"]["viewCount"]["simpleText"]
)
输出:
124 views
变量 data
包含 Python 字典 (dict
) 中的所有数据,用于打印您可以使用的所有数据:
print(json.dumps(data, indent=4))
输出(截断):
{
"responseContext": {
"serviceTrackingParams": [
{
"service": "CSI",
"params": [
{
"key": "c",
"value": "WEB"
},
{
"key": "cver",
"value": "2.20210701.07.00"
},
{
"key": "yt_li",
"value": "0"
},
{
"key": "GetWatchNext_rid",
"value": "0x1d62a299beac9e1f"
}
]
},
{
"service": "GFEEDBACK",
"params": [
{
"key": "logged_in",
"value": "0"
},
{
"key": "e",
"value": "24037443,24058293,24058128,24003103,24042870,23882685,24023960,23944779,24027649,24046896,24059898,24049577,23983296,23966208,24056265,23891346,1714258,24049575,24045412,24003105,23999405,24051884,23891344,23986022,24049573,24056839,24053866,24058240,23744176,23998056,24010336,24037586,23934970,23974595,23735348,23857950,24036947,24051353,24038425,23990875,24052245,24063702,24058380,23983813,24058812,24026834,23996830,23946420,24001373,24049820,24030040,24062848,23968386,24027689,24004644,23804281,24049569,23973490,24044110,23884386,24012512,24044124,24059521,23918597,24007246,24049567,24022729,24037794"
}
]
},
{
"service": "GUIDED_HELP",
"params": [
{
"key": "logged_in",
"value": "0"
}
]
},
{
"service": "ECATCHER",
"params": [
{
"key": "client.version",
"value": "2.20210701"
},
{
"key": "client.name",
"value": "WEB"
}
]
}
],
"mainAppWebResponseContext": {
"loggedOut": true
},
"webResponseContextExtensionData": {
"ytConfigData": {
"visitorData": "CgtoanprT1pPbmtWTSjYk46HBg%3D%3D",
"rootVisualElementType": 3832
},
答案 1 :(得分:0)
当网站动态加载时,我通常会尝试查看 API 请求(从开发工具的网络选项卡)。我在 udemy、skillshare 等网站上取得了成功,但在 youtube 上却没有。所以在这种情况下,我会使用 youtube 官方 API。它非常易于使用,并且在 github 上有大量代码示例。有了它,您只需请求您的数据并获得 json 响应。您可以使用 response.json()
转换为字典。或者另一种选择是使用 selenium,这不是我喜欢的解决方案,而且非常耗费资源和时间。从 API 请求比抓取或地球上任何其他解决方案更快。当某些东西不提供 API 时,您需要抓取