Question

我想使用代理对网站进行爬网，但是第三次尝试后，爬网程序崩溃了。这是我正在使用的代码。我的代理数据库很大，我正在使用scrapy-rotating-proxies lib。因此，我将代理设置为ROTATING_PROXY_LIST。爬行者启动，并且在一段时间后崩溃，而没有检查下一个代理并且没有下载页面。

import scrapy,sqlite3
from scrapy.crawler import CrawlerProcess
from rotating_proxies.policy import BanDetectionPolicy
from rotating_proxies.middlewares import RotatingProxyMiddleware
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.project import get_project_settings



class TestSpider(CrawlSpider):
    name = 'get_files'
    rules = (Rule(LinkExtractor(), callback='parse_item', follow=True),)
    def __init__(self, *args, **kwargs):
        super(TestSpider,self).__init__(*args,**kwargs)
        self.response_list = list()
        self.crowled_hrefs = list()
        self.allowed_domains = [kwargs.get("domain")]
        self.what_to_look = kwargs.get("what_to_look")
        self.start_urls = ["https://www."+kwargs.get("domain")]
        
    def parse_item(self,response):
        #settings = scapy.crawler.settings
        #print(settings)        
        if response.status == 200:  #or (response.status == 301)
            all_responses = response.css("a::attr(href)").extract()
            print(len(all_responses),"<<<<<<")
            for res in all_responses:
                #print(res)
                if res not in self.response_list:
                    self.response_list.append(res)
                    if res.endswith(self.what_to_look):
                        print(res)
                        print("pdf")
        else:
            return b'banned' in response.body
    
    def response_is_ban(self, request,response):
        if response not in self.crowled_hrefs:
            self.crowled_hrefs.append(response)
        else:  
            print("HELLO")
            pass
        return b'banned' in response.body
        
        
    def exception_is_ban(self, request, exception):
        print(request,"THERE",exception)
        return None
        
if __name__ == "__main__":
    conn = sqlite3.connect("hi.db", check_same_thread=False) 
    c = conn.cursor()   
    list_of_proxes = list()
    p = c.execute("SELECT proxy from proxies").fetchall()
    for i in p:
        list_of_proxes.append(i[0].rstrip())
    c.close()
    conn.close()
    print(len(list_of_proxes))
    custom_settings = { "LOG_ENABLED":True,

                    "ROTATING_PROXY_LIST":list_of_proxes,
                    #"ROTATING_PROXY_LIST":[],
                    
                    "DEPTH_LIMIT" : 1,
                    #"ROTATING_PROXY_BACKOFF_BASE":3600,
                    #"ROTATING_PROXY_BACKOFF_CAP":3600,
                    "ROTATING_PROXY_PAGE_RETRY_TIMES":5,
                    "DOWNLOAD_TIMEOUT":3,
                    "DOWNLOADER_MIDDLEWARES":{"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
                                              "rotating_proxies.middlewares.BanDetectionMiddleware": 620,},
                    }
    process = CrawlerProcess(custom_settings)
    process.crawl(TestSpider,domain="palaplast.gr", what_to_look=(".pdf",".img",".exe")) #"https://palaplast.gr/katalogos/"
    #process.crawl(TestSpider1)
    process.start()
    print("END")

我遇到了这个错误。我该如何克服该错误并继续检查代理是否存在并下载特定站点。

1761
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.19041-SP0
2020-10-21 11:56:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-21 11:56:41 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DOWNLOAD_TIMEOUT': 3}
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet Password: 70605a3422c30cb7
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'rotating_proxies.middlewares.RotatingProxyMiddleware',
 'rotating_proxies.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-21 11:56:41 [scrapy.core.engine] INFO: Spider opened
2020-10-21 11:56:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-21 11:56:41 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 1695, reanimated: 0, mean backoff time: 0s)
<GET https://www.palaplast.gr> THERE User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
2020-10-21 11:56:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.palaplast.gr> (failed 1 times): User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
<GET https://www.palaplast.gr> THERE User timeout caused connection failure.
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.palaplast.gr> (failed 2 times): User timeout caused connection failure.
<GET https://www.palaplast.gr> THERE Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409, 'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.palaplast.gr> (failed 3 times): Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409, 'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.palaplast.gr>
Traceback (most recent call last):
  File "C:\Users\john\AppData\Local\Programs\Python\Python38-32\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409, 'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 11:56:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
 'downloader/request_bytes': 648,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'elapsed_time_seconds': 6.250498,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 21, 8, 56, 47, 867509),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 2,
 'log_count/INFO': 11,
 'proxies/mean_backoff': 0.0,
 'proxies/reanimated': 0,
 'proxies/unchecked': 1695,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/twisted.internet.error.TimeoutError': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2020, 10, 21, 8, 56, 41, 617011)}
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Spider closed (finished)
END

Answer 1

有时，您必须在砍树之前先削斧头。只是改变

"ROTATING_PROXY_PAGE_RETRY_TIMES":5至"ROTATING_PROXY_PAGE_RETRY_TIMES":3600，

旋转3600个代理人

三次尝试后刮擦崩溃

1 个答案: