scrapy返回值,忽略非200响应

时间:2019-07-16 08:14:45

标签: python scrapy

当我抓取抓取网站时,收到此错误消息

Ignoring non-200 response

但是当我在浏览器中调用该网站时,我获得了200 OK

我的代码如下:

[..]
      yield scrapy.Request(url=url['name'], callback=self.parse, errback=self.errbacktest, meta={'websiteId': url['websiteId']})

def errbacktest(self, failure):
    print(failure)

    if failure.check(HttpError):
        # these exceptions come from HttpError spider middleware
        # you can get the non-200 response
        response = failure.value.response
        print('HttpError on %s', response)

    elif failure.check(DNSLookupError):
        # this is the original request
        request = failure.request
        print('DNSLookupError on %s', request.url)

    elif failure.check(TimeoutError, TCPTimedOutError):
        request = failure.request
        print('TimeoutError on %s', request.url)


def parse(self, response):

    print(response.status)

在这种情况下可能是什么问题?

1 个答案:

答案 0 :(得分:0)

解决方案:

yield scrapy.Request(url=url['name'], callback=self.parse, errback=self.errbacktest, meta={'websiteId': url['websiteId']}, headers={('User-Agent', 'Mozilla/5.0')})

站点阻止了刮擦。添加标题即可解决问题