使用scrapy从多个链接获取数据

时间:2016-03-21 19:02:12

标签: python scrapy

我是Scrapy和Python的新手。我试图从https://in.bookmyshow.com/movies中检索数据,因为我需要我试图提取数据的所有电影的信息。但是我的代码有问题,我想知道我哪里出错了。

rules = ( Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)


def parse_items(self, response):
    for sel in response.xpath('//div[contains(@class, "movie-card")]'):
        item = Ex1Item()
        item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
        item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
        item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
        item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
        item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
        yield item

1 个答案:

答案 0 :(得分:2)

你的代码似乎很好。也许这个问题不在你发布的部分之内。

这对我有用:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class BookmyshowSpider(CrawlSpider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']
    rules = (Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)

    def parse_items(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = Ex1Item()
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

编辑:使用标准蜘蛛类scrapy.Spider()

的版本
import scrapy

class BookmyshowSpider(scrapy.Spider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']

    def parse(self, response):
        links = response.xpath('//a/@href').re('movies/[^\/]+\/.*$')
        for url in set(links):
            url = response.urljoin(url)
            yield scrapy.Request(url, callback=self.parse_movie)

    def parse_movie(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = {}
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

parse()从起始页面解析所有指向电影页面的链接。 parse_movie()用作对特定电影页面的所有请求的回调。使用此版本,您当然可以更好地控制蜘蛛行为。