我想使用代理对网站进行爬网,但是第三次尝试后,爬网程序崩溃了。这是我正在使用的代码。我的代理数据库很大,我正在使用scrapy-rotating-proxies
lib。因此,我将代理设置为ROTATING_PROXY_LIST
。爬行者启动,并且在一段时间后崩溃,而没有检查下一个代理并且没有下载页面。
import scrapy,sqlite3
from scrapy.crawler import CrawlerProcess
from rotating_proxies.policy import BanDetectionPolicy
from rotating_proxies.middlewares import RotatingProxyMiddleware
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.project import get_project_settings
class TestSpider(CrawlSpider):
name = 'get_files'
rules = (Rule(LinkExtractor(), callback='parse_item', follow=True),)
def __init__(self, *args, **kwargs):
super(TestSpider,self).__init__(*args,**kwargs)
self.response_list = list()
self.crowled_hrefs = list()
self.allowed_domains = [kwargs.get("domain")]
self.what_to_look = kwargs.get("what_to_look")
self.start_urls = ["https://www."+kwargs.get("domain")]
def parse_item(self,response):
#settings = scapy.crawler.settings
#print(settings)
if response.status == 200: #or (response.status == 301)
all_responses = response.css("a::attr(href)").extract()
print(len(all_responses),"<<<<<<")
for res in all_responses:
#print(res)
if res not in self.response_list:
self.response_list.append(res)
if res.endswith(self.what_to_look):
print(res)
print("pdf")
else:
return b'banned' in response.body
def response_is_ban(self, request,response):
if response not in self.crowled_hrefs:
self.crowled_hrefs.append(response)
else:
print("HELLO")
pass
return b'banned' in response.body
def exception_is_ban(self, request, exception):
print(request,"THERE",exception)
return None
if __name__ == "__main__":
conn = sqlite3.connect("hi.db", check_same_thread=False)
c = conn.cursor()
list_of_proxes = list()
p = c.execute("SELECT proxy from proxies").fetchall()
for i in p:
list_of_proxes.append(i[0].rstrip())
c.close()
conn.close()
print(len(list_of_proxes))
custom_settings = { "LOG_ENABLED":True,
"ROTATING_PROXY_LIST":list_of_proxes,
#"ROTATING_PROXY_LIST":[],
"DEPTH_LIMIT" : 1,
#"ROTATING_PROXY_BACKOFF_BASE":3600,
#"ROTATING_PROXY_BACKOFF_CAP":3600,
"ROTATING_PROXY_PAGE_RETRY_TIMES":5,
"DOWNLOAD_TIMEOUT":3,
"DOWNLOADER_MIDDLEWARES":{"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,},
}
process = CrawlerProcess(custom_settings)
process.crawl(TestSpider,domain="palaplast.gr", what_to_look=(".pdf",".img",".exe")) #"https://palaplast.gr/katalogos/"
#process.crawl(TestSpider1)
process.start()
print("END")
我遇到了这个错误。我该如何克服该错误并继续检查代理是否存在并下载特定站点。
1761
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2020-10-21 11:56:41 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.19041-SP0
2020-10-21 11:56:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-21 11:56:41 [scrapy.crawler] INFO: Overridden settings:
{'DEPTH_LIMIT': 1, 'DOWNLOAD_TIMEOUT': 3}
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet Password: 70605a3422c30cb7
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'rotating_proxies.middlewares.RotatingProxyMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-21 11:56:41 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-21 11:56:41 [scrapy.core.engine] INFO: Spider opened
2020-10-21 11:56:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-21 11:56:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-21 11:56:41 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 1695, reanimated: 0, mean backoff time: 0s)
<GET https://www.palaplast.gr> THERE User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
2020-10-21 11:56:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.palaplast.gr> (failed 1 times): User timeout caused connection failure: Getting https://www.palaplast.gr took longer than 3.0 seconds..
<GET https://www.palaplast.gr> THERE User timeout caused connection failure.
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.palaplast.gr> (failed 2 times): User timeout caused connection failure.
<GET https://www.palaplast.gr> THERE Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409, 'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.palaplast.gr> (failed 3 times): Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409, 'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.palaplast.gr>
Traceback (most recent call last):
File "C:\Users\john\AppData\Local\Programs\Python\Python38-32\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 172.67.182.2:80 [{'status': 409, 'reason': b'Conflict'}]
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 11:56:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
'downloader/request_bytes': 648,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 6.250498,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 8, 56, 47, 867509),
'log_count/DEBUG': 2,
'log_count/ERROR': 2,
'log_count/INFO': 11,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 1695,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 10, 21, 8, 56, 41, 617011)}
2020-10-21 11:56:47 [scrapy.core.engine] INFO: Spider closed (finished)
END
答案 0 :(得分:0)
有时,您必须在砍树之前先削斧头。 只是改变
"ROTATING_PROXY_PAGE_RETRY_TIMES":5
至"ROTATING_PROXY_PAGE_RETRY_TIMES":3600
,
旋转3600个代理人