如何从分页中抓取下一页的网址。

时间:2017-02-02 11:23:06

标签: python pagination beautifulsoup

无法获取下一页的网址。引发回溯错误。基本上我想抓住“/ browse-movies?page = 2”

from bs4 import BeautifulSoup
import requests
import re
url = "https://yts.ag/browse-movies?page=1"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all('ul', 'tsc_pagination')[0]
for item in items:
    print item

2 个答案:

答案 0 :(得分:1)

您可以使用range(1, 300)来迭代所有页面:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0'}

for i in range(1, 300):
    url = "https://yts.ag/browse-movies?page=%s" % i

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    items = soup.find_all('div', 'browse-movie-wrap')
    for item in items:
        for val in item.find_all('div','browse-movie-bottom'):
            title = item.find_all('a','browse-movie-title')[0].text
            year = item.find_all('div','browse-movie-year')[0].text
        for val in item.find_all('a','browse-movie-link'):
            try:
                rating = val.find_all('h4')[0].text
                genre = val.find_all('h4')[1].text 
            except:
                pass 

        print year, rating, genre, title

P.S。您可能希望添加time.sleep(1)以减慢速度,以防他们因为过于积极地抓取他们的网页而阻止您的IP。

修改:

现在查找下一页网址,您可以使用正则表达式:

import re

next_page = soup.find('a', text=re.compile(r'.*Next.*'))
print next_page['href']

所以它的作用是寻找一个内容与正则表达式a匹配的'.*Next.*'标记。

答案 1 :(得分:1)

urls = ["https://yts.ag/browse-movies?page={}".format(i) for i in range(1, 10)]  # make a url list and iterate over it
for url in urls:
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
   # your code here
        print year, rating, genre, title

制作一个网址列表并对其进行迭代。你可以改变范围。