我一直在四处寻找Scrapy的合适的汇集系统,但我无法找到任何我需要/想要的东西。
我正在寻找解决方案:
如果代理超时或速度很慢,则应通过一系列规则将其列入黑名单......(Scrapoxy仅对实例数/初创公司列入黑名单)
如果代理很慢(接管x时间),则应将其标记为Slow
,并且应该采用时间戳并增加计数器。
Fail
,并且应该采用时间戳并增加计数器。任何人都知道任何此类解决方案(主要功能是将慢速/超时代理列入黑名单......
答案 0 :(得分:1)
由于您的投票规则非常具体,您可以编写自己的代码,请参阅下面的代码实现规则的某些部分(您必须实现其他部分):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import pexpect,time
from random import shuffle
#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
child = pexpect.spawn("telnet " + ip + " " +str(port))
time_send_request=time.time()
try:
i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
except pexpect.TIMEOUT:
i=-1
if i==0:
time_request_ok=time.time()
return {"status":True,"time_to_answer":time_request_ok-time_send_request}
else:
return {"status":False,"time_to_answer":max_timeout}
#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
for i in range(0,len(proxy_list)):
print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
proxy_list[i]["status_ok"]= proxy_status["status"]
print proxy_status
#here it is time to treat your own rule to update respective proxy dict
#~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
#~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
#~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
#~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
#~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
#~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
#~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
#~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)
if proxy_status["status"]==True:
#modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
#...
pass
else:
#modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
#...
pass
return proxy_list
#this func select a good proxy and do the job
def main():
#first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
proxy_list=[
{"ip":"167.99.2.12","port":8080}, #bad proxy
{"ip":"167.99.2.17","port":8080},
{"ip":"66.70.160.171","port":1080},
{"ip":"192.99.220.151","port":8080},
{"ip":"142.44.137.222","port":80}
# [...]
]
#this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
previous_proxy_ip=""
the_job=True
while the_job:
#here we update each proxy status
proxy_list = update_proxy_list_status(proxy_list)
#we keep only proxy considered as ok
good_proxy_list = [d for d in proxy_list if d['status_ok']==True]
#here you can shuffle the list
shuffle(good_proxy_list)
#select a proxy (not same last previous one)
current_proxy={}
for i in range(0,len(good_proxy_list)):
if good_proxy_list[i]["ip"]!=previous_proxy_ip:
previous_proxy_ip=good_proxy_list[i]["ip"]
current_proxy=good_proxy_list[i]
break
#use this selected proxy to do the job
print ("the current proxy is: "+str(current_proxy))
#UPDATE SCRAPY PROXY
#DO THE SCRAPY JOB
print "DO MY SCRAPY JOB with the current proxy settings"
#wait some seconds
time.sleep(5)
main()