将Image src包含在LinkExtractor Scrapy CrawlSpider中

时间:2017-11-14 10:42:29

标签: python scrapy

我正在抓取一个网站,我正在使用scrapy的LinkExtractor来抓取链接并确定他们的响应状态。

此外,我还想使用Link Extractor从网站获取图像src。我有一个代码,它适用于网站网址,但我似乎无法得到图像。因为它不会登录控制台。

handle_httpstatus_list = [404,502]
# allowed_domains = [''mydomain']

start_urls = ['somedomain.com/']

http_user = '###'
http_pass = '#####'

rules = (
    Rule(LinkExtractor(allow=('domain.com',),canonicalize = True, unique = True), process_links='filter_links', follow = False, callback='parse_local_link'),
    Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),
)

def filter_links(self,links):
    for link in

def parse_local_link(self, response):
    if response.status != 200:
        item = LinkcheckerItem()
        item['url'] = response.url
        item['status'] = response.status
        item['link_type'] = 'local'
        item['referer'] = response.request.headers.get('Referer',None)
        yield item

def parse_image_link(self, response):
    print "Got image link"
    if response.status != 200:
        item = LinkcheckerItem()
        item['url'] = response.url
        item['status'] = response.status
        item['link_type'] = 'img'
        item['referer'] = response.request.headers.get('Referer',None)
        yield item

2 个答案:

答案 0 :(得分:3)

如果有人有兴趣继续使用CrawlSpiderLinkExtractor s,只需添加kwarg deny_extensions,即替换:

    Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),

    Rule(LinkExtractor(allow=('cdn.domain.com'),deny_extensions=set(), tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link')

如果未设置此参数,则默认为scrapy.linkextractors.IGNORED_EXTENSIONS,其中包含jpeg,png和其他扩展名。这意味着链接提取器避免找到包含所述扩展名的链接。

答案 1 :(得分:1)

我使用Scarpy超过2年了,我总是使用start_requests()方法开始抓取网址,而不是start_urlsLinkExtractor

不要让上述事情混淆,只需使用此

即可
class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):

        urls_to_scrape = ["abc.com", "abc.com2"]

        for url in urls_to_scrape:

            yield Request(url=url, callback=self.my_callback)


    def my_callback(self, response):

        for img in response.css("img"):

            image_here = img.css("::attr(src)").extract_first()