IMDB scrapy获取所有电影数据

时间:2016-03-05 20:20:55

标签: python python-2.7 scrapy scrapy-spider

我正在研究一个课程项目并尝试将所有IMDB电影数据(标题,预算等)提供到2016年。我采用了https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py中的代码。

我的想法是:从我在范围内(1874,2016)(自1874年是http://www.imdb.com/year/上显示的最早年份),将该计划指向相应年份的网站,并从中获取数据那个网址。

但问题是,每年的每一页只显示50部电影,所以在抓取50部电影之后,我该如何进入下一页?每年爬行后,我怎么能继续下一年?这是我到目前为止解析网址部分的代码,但它只能抓取特定年份的50部电影。

class tutorialSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["imdb.com"]
    start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"] 

    def parse(self, response):
            for sel in response.xpath("//*[@class='results']/tr/td[3]"):
                item = MovieItem()
                item['Title'] = sel.xpath('a/text()').extract()[0]
                item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
                request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
                request.meta['item'] = item
                yield request

4 个答案:

答案 0 :(得分:2)

您可以使用CrawlSpiders来简化您的任务。正如您在下面看到的那样,start_requests动态生成网址列表,而parse_page仅提取要抓取的电影。查找并关注'下一步'链接由rules属性完成。

我同意@Padraic Cunningham认为硬编码值并不是一个好主意。我添加了蜘蛛参数,以便您可以调用: scrapy crawl imdb -a start=1950 -a end=1980(如果没有任何参数,刮刀将默认为1874-2016)。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from imdbyear.items import MovieItem

class IMDBSpider(CrawlSpider):
    name = 'imdb'
    rules = (
        # extract links at the bottom of the page. note that there are 'Prev' and 'Next'
        # links, so a bit of additional filtering is needed
        Rule(LinkExtractor(restrict_xpaths=('//*[@id="right"]/span/a')),
            process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
            callback='parse_page',
            follow=True),
    )

    def __init__(self, start=None, end=None, *args, **kwargs):
      super(IMDBSpider, self).__init__(*args, **kwargs)
      self.start_year = int(start) if start else 1874
      self.end_year = int(end) if end else 2016

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(self.start_year, self.end_year+1):
            yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))

    def parse_page(self, response):
        for sel in response.xpath("//*[@class='results']/tr/td[3]"):
            item = MovieItem()
            item['Title'] = sel.xpath('a/text()').extract()[0]
            # note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
            # (you will need to change it in items.py as well)
            item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
            request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
            request.meta['item'] = item
            yield request
    # make sure that the dynamically generated start_urls are parsed as well
    parse_start_url = parse_page

    # do your magic
    def parseMovieDetails(self, response):
        pass

答案 1 :(得分:1)

you can use the below piece of code to follow the next page
#'a.lister-page-next.next-page::attr(href)' is the selector to get the next page link

next_page = response.css('a.lister-page-next.nextpage::attr(href)').extract_first() # joins current and next page url
if next_page is not None:
           next_page = response.urljoin(next_page)
           yield scrapy.Request(next_page, callback=self.parse) # calls parse function again when crawled to next page

答案 2 :(得分:0)

我想出了一个非常愚蠢的方法来解决这个问题。我将所有链接放在start_urls中。非常感谢更好的解决方案!

class tutorialSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["imdb.com"]
    start_urls = []
    for i in xrange(1874, 2017):
        for j in xrange(1, 11501, 50): 
        # since the largest number of movies for a year to have is 11,400 (2016)
        start_url = "http://www.imdb.com/search/title?sort=moviemeter,asc&start=" + str(j) + "&title_type=feature&year=" + str(i) + "," + str(i)
        start_urls.append(start_url)

    def parse(self, response):
        for sel in response.xpath("//*[@class='results']/tr/td[3]"):
            item = MovieItem()
            item['Title'] = sel.xpath('a/text()').extract()[0]
            item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
            request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
            request.meta['item'] = item
            yield request

答案 3 :(得分:0)

@Greg Sadetsky提供的代码需要一些小的更改。那么parse_page方法的第一行中只有一个更改。

    Just change xpath in the for loop from:
    response.xpath("//*[@class='results']/tr/td[3]"):
    to
    response.xpath("//*[contains(@class,'lister-item-content')]/h3"):

这对我来说就像一种魅力!