我使用scrapy
抓取某些网页并收到以下错误:
twisted.internet.error.ConnectionLost
我的命令行输出:
2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened
2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished)
2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 36,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36,
'downloader/request_bytes': 8121,
'downloader/request_count': 36,
'downloader/request_method_count/GET': 36,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 4, 10, 40, 35, 608377),
'log_count/DEBUG': 38,
'log_count/ERROR': 12,
'log_count/INFO': 7,
'scheduler/dequeued': 36,
'scheduler/dequeued/memory': 36,
'scheduler/enqueued': 36,
'scheduler/enqueued/memory': 36,
'start_time': datetime.datetime(2015, 5, 4, 10, 40, 32, 624695)}
2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished)
我的settings.py
:
SPIDER_MODULES = ['proxy.spiders']
NEWSPIDER_MODULES = 'proxy.spiders'
DOWNLOAD_DELAY = 0
DOWNLOAD_TIMEOUT = 30
ITEM_PIPELINES = {
'proxy.pipelines.ProxyPipeline':100,
}
CONCURRENT_ITEMS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 64
#CONCURRENT_SPIDERS = 128
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log'
LOG_LEVEL = 'DEBUG'
LOG_STDOUT = False
我的蜘蛛proxy_spider.py
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from proxy.items import ProxyItem
import re
class ProxycrawlerSpider(CrawlSpider):
name = 'cnproxy'
allowed_domains = ['www.cnproxy.com']
indexes = [1,2,3,4,5,6,7,8,9,10]
start_urls = []
for i in indexes:
url = 'http://www.cnproxy.com/proxy%s.html' % i
start_urls.append(url)
start_urls.append('http://www.cnproxy.com/proxyedu1.html')
start_urls.append('http://www.cnproxy.com/proxyedu2.html')
def parse_ip(self,response):
sel = HtmlXPathSelector(response)
addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>')
locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>')
ports_re = re.compile('write\(":"(.*)\)')
raw_ports = ports_re.findall(response.body);
port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''}
ports = []
for port in raw_ports:
tmp = port
for key in port_map:
tmp = tmp.replace(key,port_map[key]);
ports.append(tmp)
items = []
for i in range(len(addresses)):
item = ProxyItem()
item['address'] = addresses[i]
item['protocol'] = protocols[i]
item['location'] = locations[i]
item['port'] = ports[i]
items.append(item)
return items
我的管道或设置有什么问题吗?
如果不是,我该如何防止twisted.internet.error.ConnectionLost
错误。
我尝试了scrapy shell
$scrapy shell http://www.cnproxy.com/proxy1.html
并获得与标题相同的错误。 但我可以使用我的Chrome访问该页面。我尝试了其他页面,如
$scrapy shell http://stackoverflow.com
他们都运作良好。
答案 0 :(得分:9)
您需要设置用户代理字符串。似乎有些网站不喜欢它,并在您的用户代理不是浏览器时阻止。 您可以找到examples of user agent strings。
此article标识了阻止蜘蛛被阻止的最佳做法。
打开settings.py
:添加以下用户代理
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'
您还可以尝试user-agent randomiser