Question

我正在写一个刮刀，以便在Hungama.com上获得所有电影列表

我正在请求“http://www.hungama.com/all/hungama-picks-54/4470/”网址以获得回复。

当您转到此网址时，屏幕上会显示12部电影，但当您向下滚动时，电影数量会因自动重新加载而增加。

我正在使用以下代码解析html源代码页

response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()

但我只有12件物品，而且还有更多的电影物品。我怎样才能获得所有电影。请帮忙。

Answer 1

似乎有一个ajax请求作为延迟加载功能，网址为http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN，可以抓取电影。
在上面的网址更改2到3（http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN）等等，以获取下一部电影的详细信息。

Answer 2

在向下滚动该页面的内容时，如果您仔细查看xhr中network类别中的dev tools标签，那么您可以看到它生成了一些具有分页功能的网址附在它上面：http://www.hungama.com/all/hungama-picks-54/3632/2/。因此，如下所示更改行，您可以从该页面获取所有内容。

import requests
from scrapy import Selector

page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"

while True:
    page+=1
    res = requests.get(URL)
    sel = Selector(res)
    container = sel.css(".leftbox")
    if len(container)<=0:break

    for item in container:
        title = item.css("#pajax_a::text").extract_first()
        year = item.css(".subttl::text").extract_first()
        print(title,year)

    next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
    URL = next_page.format(page)

顺便说一句，您在上面提供的网址无效。我提供的那个现在很活跃。但是，你理解我的逻辑。

我们如何得到下一个加载网页的响应

2 个答案: