我想在抓取一个eventbrite页面后继续下一页,但即使在使用Scrapy的Crawlspider之后它也无法正常工作。
这是遍历页面的代码
allowed_domains = ["eventbrite.com"]
start_urls = ["https://www.eventbrite.com/d/nigeria--lagos/events/?crt=regular&end_date=01%2F31%2F2018&page=1&sort=best&start_date=12%2F01%2F2017",
]
def parse(self, response):
events = Selector(response).xpath('//div[@class="list-card-v2 l-mar-top-2 js-d-poster"]')
for event in events:
name = event.xpath('a/div[@class="list-card__body"]/div[@class="list-card__title"]/text()').extract()
venue = event.xpath('a/div[@class="list-card__body"]/div[@class="list-card__venue"]/text()').extract()
date = event.xpath('a/div[@class="list-card__body"]/time[@class="list-card__date"]/text()').extract()
event_type = event.xpath('a/div[@class="list-card__header"]/span/text()').extract()
category = event.xpath('div/div[@class="list-card__tags"]/a/text()').extract()
image= event.xpath('a/div[@class="list-card__header"]/div/img[@class="js-poster-image"]').extract()
image_url= event.xpath('a/div[@class="list-card__header"]/div/img[@class="js-poster-image"]/@src').extract()
name = ''.join(name).replace('\n', '').strip()
date = ''.join(date).replace('\n', '').strip()
venue = ''.join(venue).replace('\n', '').strip()
yield EventsItem(name=name, venue=venue, date=date,
event_type=event_type, category=category,
image_urls=image_url, images=image)
next_page = response.xpath('//a[@data-automation="next-page"]/@href').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
这是元素的图像。我不知道是不是因为href属性是空的还是错误的xpath。
欢迎任何帮助,谢谢。
答案 0 :(得分:0)
代替最后一行:
yield scrapy.Request(next_page, callback=self.parse)
尝试一下:
yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)
注意:
请注意允许的URL。在某些情况下,它们不应包含http
或https
。在这种情况下,请使用google.com
代替https://www.google.com
。