无法获取下一页的网址。引发回溯错误。基本上我想抓住“/ browse-movies?page = 2”
from bs4 import BeautifulSoup
import requests
import re
url = "https://yts.ag/browse-movies?page=1"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all('ul', 'tsc_pagination')[0]
for item in items:
print item
答案 0 :(得分:1)
您可以使用range(1, 300)
来迭代所有页面:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
for i in range(1, 300):
url = "https://yts.ag/browse-movies?page=%s" % i
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all('div', 'browse-movie-wrap')
for item in items:
for val in item.find_all('div','browse-movie-bottom'):
title = item.find_all('a','browse-movie-title')[0].text
year = item.find_all('div','browse-movie-year')[0].text
for val in item.find_all('a','browse-movie-link'):
try:
rating = val.find_all('h4')[0].text
genre = val.find_all('h4')[1].text
except:
pass
print year, rating, genre, title
P.S。您可能希望添加time.sleep(1)
以减慢速度,以防他们因为过于积极地抓取他们的网页而阻止您的IP。
修改:
现在查找下一页网址,您可以使用正则表达式:
import re
next_page = soup.find('a', text=re.compile(r'.*Next.*'))
print next_page['href']
所以它的作用是寻找一个内容与正则表达式a
匹配的'.*Next.*'
标记。
答案 1 :(得分:1)
urls = ["https://yts.ag/browse-movies?page={}".format(i) for i in range(1, 10)] # make a url list and iterate over it
for url in urls:
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
# your code here
print year, rating, genre, title
制作一个网址列表并对其进行迭代。你可以改变范围。