当重定向的URL出现DNS外观错误时,如何仅抓取URL?

时间:2019-07-05 11:52:12

标签: python scrapy

我有一个URL列表,它是缩写形式,打开后会重定向到网站url。有些网站出现DNS错误,而有些则无法打开。但是它们仍然具有缩短的怪异网址。

我想获取所有重定向的url,而不管错误。

我将所有缩短的加密类型url放入文本文件中。还添加了errback来处理错误。这是蜘蛛。

import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class HouzzSpiderSpider(scrapy.Spider):
    name = 'web_uk'
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()


# this is just to no retry errors for this example spider
    custom_settings = {
        'RETRY_ENABLED': False
    }

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_web, errback=self.errback_web, dont_filter=True)


    def parse_web(self, response):

        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

        item = {}
        item['Web Address']= response.request.url
        yield item
    def errback_web(self, failure):
        # log all failures
        self.logger.error(repr(failure))
        item ={}
        item['Web Address']= failure.request.url
        yield item

有了这个蜘蛛,我只能获得很少的网站网址。在输出中,我可以看到所有缩短的网址都已处理。但是对于某些状态,它不会返回到项目。

这是输出,我得到了

    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.as-propertyservices.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://as-propertyservices.co.uk/> from <GET http://www.as-propertyservices.co.uk>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://as-propertyservices.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://as-propertyservices.co.uk/> (referer: None) ['cached']
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.shortconstruction.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.shortconstruction.co.uk/>
    {'Web Address': 'http://www.shortconstruction.co.uk/'}
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.baptistbuilding.co.uk/> from <GET https://www.houzz.in/trk/aHR0cDovL3d3dy5iYXB0aXN0YnVpbGRpbmcuY28udWsv/882c7009dd2fe2ce02c78694984d386e/ue/NDQxODIwMTc/496a774dd622696bd65956be0d3809f2>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.baptistbuilding.co.uk/robots.txt> from <GET http://www.baptistbuilding.co.uk/robots.txt>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baptistbuilding.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.baptistbuilding.co.uk/> from <GET http://www.baptistbuilding.co.uk/>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baptistbuilding.co.uk/> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.gpsbuilding.com> from <GET https://www.houzz.in/trk/aHR0cDovL3d3dy5ncHNidWlsZGluZy5jb20/2da4248ce7697e1410323e514ea2e333/ue/MzEwNzk5NDM/bf5fdb1c62f115d1790cdfbf1f1414d2>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://gpsbuilding.com/robots.txt> from <GET http://www.gpsbuilding.com/robots.txt>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://gpsbuilding.com/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://gpsbuilding.com/> from <GET http://www.gpsbuilding.com>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://gpsbuilding.com/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.afternic.com/domain/gpsbuilding.com> from <GET http://gpsbuilding.com/>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.afternic.com/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36

    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://london-construction.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://london-construction.co.uk/> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://smcbuildersuk.co.uk> from <GET https://www.houzz.in/trk/aHR0cDovL3NtY2J1aWxkZXJzdWsuY28udWs/6fbc4a3ef745843f9e875e443c7cc8d1/ue/NDk4NzE1MzY/89ac8410a67f2aef0e00833d8fb8622a>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://smcbuildersuk.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://smcbuildersuk.co.uk> (referer: None) ['cached']
    2019-07-05 09:59:53 [web_uk] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <403 http://www.bluestakeconstruction.co.uk>
    {'Web Address': 'http://www.bluestakeconstruction.co.uk'}
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.gsplus.co.uk> from <GET https://www.houzz.in/trk/aHR0cDovL3d3dy5nc3BsdXMuY28udWs/652dc6ee1bf58acf83b8faf65c2b32c5/ue/NTAxMTE0MTc/862de77b7be8dcee0699563bf88a43aa>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.gsplus.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.gsplus.co.uk/> from <GET http://www.gsplus.co.uk>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gsplus.co.uk/> (referer: None) ['cached']
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://as-propertyservices.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://as-propertyservices.co.uk/>
    {'Web Address': 'http://as-propertyservices.co.uk/'}
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.pbmrefurbishment.co.uk> from <GET https://www.houzz.in/trk/aHR0cDovL3d3dy5wYm1yZWZ1cmJpc2htZW50LmNvLnVr/8aac98e953ff258a456957dcc2fce880/ue/NDc4Mzg5NTM/2aef81ebc680c9aa6c661ab437aff3bb>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.pbmrefurbishment.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.pbmrefurbishment.co.uk> (referer: None) ['cached']
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from https://www.baptistbuilding.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.baptistbuilding.co.uk/>
    {'Web Address': 'https://www.baptistbuilding.co.uk/'}
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.arkbuilders.co.uk> from <GET https://www.houzz.in/trk/aHR0cDovL3d3dy5hcmtidWlsZGVycy5jby51aw/83b53493f8deec8129c7263248ccff4b/ue/NDgwNjUzODA/be08230a31a454ac270e12bbb26b27c4>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.arkbuilders.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.arkbuilders.co.uk> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.atlantic03.co.uk/> from <GET https://www.houzz.in/trk/aHR0cDovL3d3dy5hdGxhbnRpYzAzLmNvLnVrLw/88bfdfc8bfc1af6cbcd812df501d0ff5/ue/Mjg2NTgzODE/ac392f00019cc3e1117be39eebec2f13>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.atlantic03.co.uk/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.atlantic03.co.uk/> (referer: None) ['cached']
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from https://www.afternic.com/domain/gpsbuilding.com
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.afternic.com/domain/gpsbuilding.com>
    {'Web Address': 'https://www.afternic.com/domain/gpsbuilding.com'}
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
    2019-07-05 09:59:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.surreybuildersuk.com/> from <GET https://www.houzz.in/trk/aHR0cHM6Ly93d3cuc3VycmV5YnVpbGRlcnN1ay5jb20v/1ab2690d7c57da047029d4ea8fe4f537/ue/NDg1NTA3MTk/584c215500b700b2920c5b456a9aebc0>
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.surreybuildersuk.com/robots.txt> (referer: None) ['cached']
    2019-07-05 09:59:53 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
    2019-07-05 09:59:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.surreybuildersuk.com/> (referer: None) ['cached']
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.h4csltd.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.h4csltd.co.uk/>
    {'Web Address': 'http://www.h4csltd.co.uk/'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.grangecontractors.co.uk
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.grangecontractors.co.uk>
    {'Web Address': 'http://www.grangecontractors.co.uk'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.dgconstruction.co.uk
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.dgconstruction.co.uk>
    {'Web Address': 'http://www.dgconstruction.co.uk'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from https://london-construction.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://london-construction.co.uk/>
    {'Web Address': 'https://london-construction.co.uk/'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://smcbuildersuk.co.uk
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://smcbuildersuk.co.uk>
    {'Web Address': 'http://smcbuildersuk.co.uk'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from https://www.gsplus.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gsplus.co.uk/>
    {'Web Address': 'https://www.gsplus.co.uk/'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.pbmrefurbishment.co.uk
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.pbmrefurbishment.co.uk>
    {'Web Address': 'http://www.pbmrefurbishment.co.uk'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.arkbuilders.co.uk
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.arkbuilders.co.uk>
    {'Web Address': 'http://www.arkbuilders.co.uk'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from http://www.atlantic03.co.uk/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.atlantic03.co.uk/>
    {'Web Address': 'http://www.atlantic03.co.uk/'}
    2019-07-05 09:59:53 [web_uk] INFO: Got successful response from https://www.surreybuildersuk.com/
    2019-07-05 09:59:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.surreybuildersuk.com/>
    {'Web Address': 'https://www.surreybuildersuk.com/'}
    2019-07-05 09:59:56 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.comingsoon.>
    2019-07-05 09:59:56 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: www.comingsoon.
    {'Web Address': 'http://www.comingsoon'}
    2019-07-05 09:59:57 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.brettevansconstruction.co.uk.>
    2019-07-05 09:59:57 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: www.brettevansconstruction.co.uk.
    {'Web Address': 'http://www.brettevansconstruction.co.uk'}
    2019-07-05 09:59:57 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.cawilsonbuilders.co.uk/robots.txt>: DNS lookup failed: no results for hostname lookup: www.cawilsonbuilders.co.uk.
    Traceback (most recent call last):
      File "/home/ubuntu/scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
        result = result.throwExceptionIntoGenerator(g)
      File "/home/ubuntu/scrapy/local/lib/python2.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/ubuntu/scrapy/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
        defer.returnValue((yield download_func(request=request,spider=spider)))
      File "/home/ubuntu/scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/ubuntu/scrapy/local/lib/python2.7/site-packages/twisted/internet/endpoints.py", line 975, in startConnectionAttempts
        "no results for hostname lookup: {}".format(self._hostStr)
    DNSLookupError: DNS lookup failed: no results for hostname lookup: www.cawilsonbuilders.co.uk.
    2019-07-05 09:59:57 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
    2019-07-05 09:59:57 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.gazabuilders.co.uk.>
    2019-07-05 09:59:57 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: www.gazabuilders.co.uk.
    {'Web Address': 'http://www.gazabuilders.co.uk/'}
    2019-07-05 09:59:58 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.cawilsonbuilders.co.uk.>
    2019-07-05 09:59:58 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: www.cawilsonbuilders.co.uk.
    {'Web Address': 'http://www.cawilsonbuilders.co.uk'}
    2019-07-05 09:59:58 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: mckjoinersandbuilders.co.uk.>
    2019-07-05 09:59:58 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: mckjoinersandbuilders.co.uk.
    {'Web Address': 'http://mckjoinersandbuilders.co.uk'}
    2019-07-05 09:59:58 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: jklynch.co.uk.>
    2019-07-05 09:59:58 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: jklynch.co.uk.
    {'Web Address': 'http://jklynch.co.uk'}
    2019-07-05 09:59:59 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.bartlettbuildingltd.co.uk.>
    2019-07-05 09:59:59 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: www.bartlettbuildingltd.co.uk.
    {'Web Address': 'http://www.bartlettbuildingltd.co.uk/'}
    2019-07-05 10:00:00 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.knightsbridgeconstruction-prc.com.>
    2019-07-05 10:00:00 [scrapy.core.scraper] DEBUG: Scraped from DNS lookup failed: no results for hostname lookup: www.knightsbridgeconstruction-prc.com.
    {'Web Address': 'http://www.knightsbridgeconstruction-prc.com/index.html'}
    2019-07-05 10:00:52 [scrapy.extensions.logstats] INFO: Crawled 179 pages (at 179 pages/min), scraped 93 items (at 93 items/min)
    2019-07-05 10:01:52 [scrapy.extensions.logstats] INFO: Crawled 179 pages (at 0 pages/min), scraped 93 items (at 0 items/min)
    2019-07-05 10:02:04 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.moxoms.co.uk/robots.txt>: TCP connection timed out: 110: Connection timed out.
    Traceback (most recent call last):
      File "/home/ubuntu/scrapy/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
        defer.returnValue((yield download_func(request=request,spider=spider)))
    TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
    2019-07-05 10:02:04 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
    2019-07-05 10:02:52 [scrapy.extensions.logstats] INFO: Crawled 179 pages (at 0 pages/min), scraped 93 items (at 0 items/min)
    2019-07-05 10:03:52 [scrapy.extensions.logstats] INFO: Crawled 179 pages (at 0 pages/min), scraped 93 items (at 0 items/min)
    2019-07-05 10:04:15 [web_uk] ERROR: <twisted.python.failure.Failure twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.>
    2019-07-05 10:04:15 [scrapy.core.scraper] DEBUG: Scraped from TCP connection timed out: 110: Connection timed out.
    {'Web Address': 'http://www.moxoms.co.uk/'}
    2019-07-05 10:04:15 [scrapy.core.engine] INFO: Closing spider (finished)
    2019-07-05 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

1 个答案:

答案 0 :(得分:0)

您收到DNS错误,因为这些域无效,因此无法解析其IP地址。另外, www 是一个子域,对于某些网站可能会出现问题。尝试通过剥离 www 来请求裸域。

更改此行可以减少有效域的DNS问题。

for u.replace('www.', '') in self.start_urls: