Question

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://www.twitch.tv/directory/game/League%20of%20Legends/clips")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

print(links)

这是我到目前为止的代码，我不太确定如何修改它以获取Twitch上的剪辑链接。

Answer 1

URL是动态创建的，因此只是尝试加载HTML是不够的。通过查看浏览器获取数据的请求，它将在JSON对象中返回。

您需要使用类似selenium的内容来自动化浏览器以获取所有网址，或者自行请求JSON，如下所示：

import requests

url = "https://gql.twitch.tv/gql"
json_req = """[{"query":"query ClipsCards__Game($gameName: String!, $limit: Int, $cursor: Cursor, $criteria: GameClipsInput) { game(name: $gameName) { id clips(first: $limit, after: $cursor, criteria: $criteria) { pageInfo { hasNextPage __typename } edges { cursor node { id slug url embedURL title viewCount language curator { id login displayName __typename } game { id name boxArtURL(width: 52, height: 72) __typename } broadcaster { id login displayName __typename } thumbnailURL createdAt durationSeconds __typename } __typename } __typename } __typename } } ","variables":{"gameName":"League of Legends","limit":100,"criteria":{"languages":[],"filter":"LAST_DAY"},"cursor":"MjA="},"operationName":"ClipsCards__Game"}]"""
r = requests.post(url, data=json_req, headers={"client-id":"kimne78kx3ncx6brgo4mv6wki5h1ko"})
r_json = r.json()

edges = r_json[0]['data']['game']['clips']['edges']
urls = [edge['node']['url'] for edge in edges]

for url in urls:
    print url

这将为您提供以{：1>开头的第一个100网址

https://clips.twitch.tv/CourageousOnerousChoughWOOP
https://clips.twitch.tv/PhilanthropicAssiduousSwordHassaanChop
https://clips.twitch.tv/MistyThoughtfulLardPRChase
https://clips.twitch.tv/HotGoldenAmazonSSSsss
https://clips.twitch.tv/RelievedViscousPangolinOSsloth

如何从Python上Twitch的特定频道获取所有链接或剪辑？

1 个答案: