在使用Scrapy进行抓取时,如何忽略pdf链接?

时间:2018-07-26 18:15:00

标签: python scrapy scrapy-spider

我是Scrapy的新手,目前正在制作蜘蛛,仅从网站中提取事件标题和事件描述。我能够获得标题和说明,但是,蜘蛛程序还试图从pdf链接中提取数据,这会导致“引发NotSupported(“响应内容不是文本”)“错误。如何防止蜘蛛这样做?

这是我的代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class EventsspiderSpider(CrawlSpider):
    name = 'eventsspider'
    allowed_domains =['cs.acadiau.ca']
    start_urls = ['https://cs.acadiau.ca/news-events/event-reader/using-dna-to-reverse-engineer-your-family-tree.html']

    rules = (
        Rule(LinkExtractor(allow=('news-events/event-reader/using-dna-to-reverse-engineer-your-family-tree.html', )), callback='parse_item', follow=True),)

    def parse_item(self, response):
        i = {}

        title_list = response.xpath('//*[@id="event-items-15421"]/div[2]/div/h1/text()').extract()
        data_list = response.xpath('//*[@id="event-items-15421"]/div[2]/div/div[1]/p[7]/span/text()').extract()

        for x in range(0, len(title_list)):
            i['title'] = title_list[x]
            i['data'] = data_list[x]
            yield i

0 个答案:

没有答案