Question

from bs4 import BeautifulSoup
import pandas as pd
import requests

r = requests.get('https://reelgood.com/source/netflix')
soup = BeautifulSoup(r.text, 'html.parser')

title = soup.find_all('tr',attrs={'class':'cM'})

records = []
for t in title:
    movie = t.find(attrs={'class':'cI'}).text
    year = t.find(attrs={'class':'cJ'}).findNext('td').text
    rating = t.find(attrs={'class':'cJ'}).findNext('td').findNext('td').text
    score = t.find(attrs={'class':'cJ'}).findNext('td').findNext('td').findNext('td').text
    rottenTomatoe = t.find(attrs={'class':'cJ'}).findNext('td').findNext('td').findNext('td').findNext('td').text
    episodes = t.find(attrs={'class':'c0'}).text[:3]
    records.append([movie, year, rating, score, rottenTomatoe, episodes])

df = pd.DataFrame(records, columns=['movie', 'year', 'rating', 'score', 'rottenTomatoe', 'episodes'])

上面的代码得到49条记录，这是第一页。有43页我想刮。每次你去下一页获得接下来的50个视频时，最初从第一页到第二页的网址会添加“？offset = 150”，然后每页后面的页面增加100.这是url看起来的一个例子比如最后一页（正如你可以看到偏移= 4250）“https://reelgood.com/source/netflix?offset=4250”

任何有关如何获取所有页面结果集的帮助都会非常有帮助。谢谢

Answer 1

我想最简单的方法就是抓住class ='eH'，其中包含更多内容的链接。

这是页面上唯一具有该值的类。当你达到offset = 4250时，链接就消失了。

所以循环就是这样的：

records = []
keep_looping = True
url = "https://reelgood.com/source/netflix"
while keep_looping:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    # grab your content here and store it and find the next link to visit.
    title = soup.find....
    for t in title:
        ....
        records.append...
    # if the tag does not exist, url will be None
    # we will then tell the while-loop to stop by setting the keep_looping flag to False"
    url_tag = soup.find('a', class_='eH')
    # returns not absolute urls but "/source/netflix?offset=150"
    if not url_tag:
        keep_looping = False
    else:
        url = "https://www.reelgood.com" + url_tag.get('href')
df = pd.DataFrame...

Answer 2

我在Reelgood工作。请注意，https://reelgood.com上的班级名称会在我们发布网络应用更新时发生变化。

我们非常乐意与您在这里完成的任何事情伸出援助之手，请随时给我发电子邮件至luigi@reelgood.com。

网址更改时网页报废多个页面并添加'offset = [＃here]'

2 个答案: