我需要有关网站ex上的解析器的一些帮助:https://filmy.plus/kategoria/Horror
我编写了一些代码,并且一切正常,但仅适用于12个视频(首次加载)
url = 'https://filmy.plus/kategoria/Horror'
response = get(url)
#print(response.text[:6000])
html_soup = BeautifulSoup(response.text, 'lxml')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'movie-box-3 movie-box-search')
print(url, "\nLiczba Filmów: ", len(movie_containers),"\n")
for i in range(0,len(movie_containers)):
first_movie = movie_containers[i]
print(first_movie.a.h2.text)
print('https://filmy.plus'+first_movie.a['href']+'\n')
但是如何加载所有视频?我需要单击“Pokażwięcej” 3-4次?用于加载网站中的所有视频。我不知道如何绕过它并从url获取所有链接
谢谢
答案 0 :(得分:1)
硬路
模拟的硒webdriver单击下一页
简便方式
使用网站api
https://filmy.plus/jquery_kategorie_pokaz_wiecej.php?kategoria=Horror&strona=1
答案 1 :(得分:1)
作为@DivideBy0
mentioned,您可以使用API抓取所有数据:
import re
import requests
result = {}
for i in range(100):
response = requests.get('https://filmy.plus/jquery_kategorie_pokaz_wiecej.php?kategoria=Horror&strona={}'.format(i+1))
for film in response.json()['wynik']:
title = re.findall('title=\"(.*)\"', film)[0]
link = 'https://filmy.plus' + re.findall('href=\"(.*)\" ', film)[0]
result[title] = link
print('Videos found: {}'.format(len(result)))
for i, el in enumerate(result.items()):
print('{}. {} {}'.format(i+1, el[0], el[1]))
您将获得输出:
Videos found: 66
1. Anakondy: Polowanie na Krwawą Orchideę https://filmy.plus/film2/Anakondy.Polowanie.Na.Krwawa.Orchidee
2. Uciec przeznaczeniu https://filmy.plus/film/Uciec+przeznaczeniu-2009-378067
3. Jad https://filmy.plus/film/Jad-1981-11436
4. Venom https://filmy.plus/film/Venom-1971-37749
5. Zakonnica https://filmy.plus/film/Zakonnica-2018-777024
等...