不能用Scrapy转到下一页

时间:2017-12-06 12:43:09

标签: python xpath scrapy

我想在抓取一个eventbrite页面后继续下一页,但即使在使用Scrapy的Crawlspider之后它也无法正常工作。

这是遍历页面的代码

 allowed_domains = ["eventbrite.com"]
start_urls = ["https://www.eventbrite.com/d/nigeria--lagos/events/?crt=regular&end_date=01%2F31%2F2018&page=1&sort=best&start_date=12%2F01%2F2017",
]    
 def parse(self, response):
    events = Selector(response).xpath('//div[@class="list-card-v2 l-mar-top-2 js-d-poster"]')

    for event in events:
        name = event.xpath('a/div[@class="list-card__body"]/div[@class="list-card__title"]/text()').extract()
        venue = event.xpath('a/div[@class="list-card__body"]/div[@class="list-card__venue"]/text()').extract()
        date = event.xpath('a/div[@class="list-card__body"]/time[@class="list-card__date"]/text()').extract()
        event_type = event.xpath('a/div[@class="list-card__header"]/span/text()').extract()
        category = event.xpath('div/div[@class="list-card__tags"]/a/text()').extract()
        image= event.xpath('a/div[@class="list-card__header"]/div/img[@class="js-poster-image"]').extract()
        image_url= event.xpath('a/div[@class="list-card__header"]/div/img[@class="js-poster-image"]/@src').extract()

        name = ''.join(name).replace('\n', '').strip()
        date = ''.join(date).replace('\n', '').strip()
        venue = ''.join(venue).replace('\n', '').strip()


        yield EventsItem(name=name, venue=venue, date=date,
                         event_type=event_type, category=category,
                         image_urls=image_url, images=image)

        next_page = response.xpath('//a[@data-automation="next-page"]/@href').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

这是元素的图像。我不知道是不是因为href属性是空的还是错误的xpath。

Image of next page html element

欢迎任何帮助,谢谢。

1 个答案:

答案 0 :(得分:0)

代替最后一行:

yield scrapy.Request(next_page, callback=self.parse)

尝试一下:

yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)

注意: 请注意允许的URL。在某些情况下,它们不应包含httphttps。在这种情况下,请使用google.com代替https://www.google.com