Question

再次尝试获得一些大学研究的帮助。我试图找出一种方法来刮掉每部电影的所有评论，而无需手动编写每个网址并在一组中迭代它。

所以，我正在尝试找到“下一步”按钮并使用它来指导要收集的评论页数。理论上我希望它停留在评论的最后一页，因为最后一页上没有“下一步”按钮。因此，如果有三页评论，则会停止在第三页上进行评论。

为了简单起见，这只是我现在的一些代码，但它只获得了评论的第一页。

import requests
from bs4 import BeautifulSoup

s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
           'Headers': "http://www.imdb.com/"}

count = 0
url = 'http://www.imdb.com/title/tt0182408/reviews?start=' + str(count)
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

nv = soup.find("input", value="nv_sr_fn")["value"]

hidden_data = dict(ref_=nv)

s.post(url, data=hidden_data, headers=headers)

important = soup.find("div", id='tn15content')


for div in important.findAll("div"):
    for p in div.findAll("p"):
        p.decompose()

for small in important.findAll("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("r/")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
    print(div.findAll("small"))
    print(div.find_next("h2").text.strip())
    print(div.find_next("a").text.strip())
    print(div.find_next("p").text.strip())

for td in important.findAll('td'):
    for a in td.findAll('a'):
        for img in a.findAll('img', alt=True):
            if img['alt'] == "[Next]":
                count = +10

            else:
                break

这是我在第一页上的最后一次评论。

ur0186755 1/10
[<small>11 out of 20 people found the following review useful:</small>, <small>from South Texas</small>, <small>27 March 1999</small>]
One of the stupidest films ever made...

Before I start to tear apart this movie, mark you--I LOVE THE SCARLET
PIMPERNEL. That story is one of the best romantic adventures ever written.
The movie staring Jane Grey is very good and the musical on Broadway is
the
hottest thing there. So, I thought when I heard that this film was coming
out that it would be great since it was a BBC film.To my surprise, it was a weak, totally stupid story that UTTERLY failed in
capturing the gorgeous tale.There were no exciting escapes with daring disguises. There was no deep
love
that made your heart flutter as Percy left the room and Marguerite sighed
as
her husband was leaving her again.All it had was a confusing plot and a lot of out-of-the-blue sex and
violence.Sink me! What a horrible movie!

有关如何从每个页面收集评论以及手动将网址放入集合并迭代它们的任何提示。或者我必须这样做？非常感谢。

Answer 1

首先，请确保您没有违反任何并保持合法的一面。您最好使用IMDB API，而不是使用Web抓取。

为了回答你的问题，我会根据Next链接的存在，制作一个无休止的循环条件：

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
        'Headers': "http://www.imdb.com/"
    }

    page = 0
    while True:
        url = 'http://www.imdb.com/title/tt0182408/reviews?start=' + str(page)
        response = session.get(url)
        soup = BeautifulSoup(response.content, "lxml")

        important = soup.find("div", id='tn15content')
        for title in important.find_all("h2"):
            print(title.get_text())

        # break if no Next button present
        if not soup.find("img", alt="[Next]"):
            break

        page += 10

打印30行评论标题（每页10行）。

Answer 2

您可以继续前进，直到 img 并且alt不在页面上，您可以通过在 img上调用.parent来获取下一页 href 标签：

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

def parse(soup):
    important = soup.find("div", id='tn15content')
    for small in important.find_all("small", text=re.compile("review useful:")):
        div = small.parent
        user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
        rating = div.select_one("img[alt*=/10]")
        yield user_id, rating["alt"] if rating else "N/A"


def get_all_pages(start):
    base = "http://www.imdb.com/title/tt0082158/"
    soup = BeautifulSoup(requests.get(start).content)
    for tup in parse(soup):
        yield tup

    for nxt in iter(lambda: soup.find("img", alt="[Next]"), None):
        soup = BeautifulSoup(requests.get(urljoin(base, nxt.parent["href"])).content)
        for tup in parse(soup):
            yield tup


for uid, rat in get_all_pages(start):
    print(uid, rat)

您可能还想考虑在每个请求之间添加一个睡眠，或者更好地再次使用IMDbpy

我可以使用alt = [Next]从每个页面收集评论吗？

2 个答案: