我正在与Scrapy,Privoxy和Tor打交道。我已安装并正常工作。但Tor每次都使用相同的IP连接,因此我很容易被禁止。是否有可能告诉Tor重新连接每个X秒或连接?
谢谢!
关于配置的编辑: 对于用户代理池,我这样做了:http://tangww.com/2013/06/UsingRandomAgent/(我必须按照评论中的说法放置一个_ init _.py文件),并且对于Privoxy和Tor我遵循了http://www.andrewwatters.com/privoxy/(我必须使用终端手动创建私人用户和私人组)。它起作用了:))
我的蜘蛛是这样的:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "spider_name"
start_urls = [
'https://example.com/listviews/titles.php',
]
allowed_domains = ["example.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield Request(next_page, callback=self.parse)
# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
def parse_following_urls(self, response):
#Parsing rules go here
for each_book in response.css('main#main'):
yield {
'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
}
在settings.py中,我有一个用户代理轮换和privoxy:
DOWNLOADER_MIDDLEWARES = {
#user agent
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'spider_name.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
#privoxy
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'spider_name.middlewares.ProxyMiddleware': 100
}
在middlewares.py中我添加了:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
我认为这就是......
编辑II ---
好的,我更改了我的middlewares.py文件,如博客@TomášLinhart所说:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
要
from stem import Signal
from stem.control import Controller
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
def set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
但是现在真的很慢,并且似乎没有改变ip ...我做得好或者出了什么问题?
答案 0 :(得分:8)
此blog post可能对您有所帮助,因为它处理同样的问题。
编辑:根据具体要求(每个请求的新IP或 N 请求之后),在set_new_ip
中对process_request
进行适当的调用中间件的方法。但请注意,拨打set_new_ip
功能并不一定要始终确保新的IP(这是指向常见问题解答的链接以及解释)。
EDIT2: ProxyMiddleware
类的模块如下所示:
from stem import Signal
from stem.control import Controller
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
class ProxyMiddleware(object):
def process_request(self, request, spider):
_set_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
答案 1 :(得分:6)
但Tor每次都使用相同的IP连接
需要注意的一点是,新电路并不一定 表示新的IP地址。基于启发式随机选择路径 喜欢速度和稳定性。这里只有很多大型出口 Tor网络,因此重复使用您已经拥有的出口并不罕见 先前。
这就是为什么使用下面的代码会导致重新使用相同的IP地址的原因。
from stem import Signal
from stem.control import Controller
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
https://github.com/DusanMadar/TorIpChanger可帮助您管理此行为。录取 - 我写了TorIpChanger
。
我还整理了一篇关于如何在Tor和Privoxy中使用Python的指南:https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b。
以下是您在TorIpChanger
中使用pip install toripchanger
(ProxyMiddleware
)的示例。
from toripchanger import TorIpChanger
# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
def process_request(self, request, spider):
ip_changer.get_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
或者,如果您想在10次请求后使用其他IP,则可以执行以下操作。
from toripchanger import TorIpChanger
# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
_requests_count = 0
def process_request(self, request, spider):
self._requests_count += 1
if self._requests_count > 10:
self._requests_count = 0
ip_changer.get_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])