Question

我正在尝试抓取网站，它足够复杂以阻止机器人，我的意思是它只允许少量请求，在Scrapy挂起之后。

问题1：有没有办法，如果Scrapy挂起，我可以从同一点重新开始我的抓取过程。为了摆脱这个问题，我写了像这样的设置文件

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

这是我的计划：

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

问题是我可以编写http代理的地方，我必须导入任何与tor相关的类，我是Scrapy的新手，因为我学到了这么多，现在我正在努力学习“如何使用ip旋转或者'tor'

正如我们的一位成员建议的那样，我开始使用，我将HTTP_PROXY设置为

set http_proxy=http://localhost:8118

但它会引发一些错误，

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

所以我将http_proxy更改为

set http_proxy=http://localhost:9051

现在错误是

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

我检查了firefox网络设置，在那里我看不到任何http代理，而是使用SOCKSV5，它显示127.0.0.1:9051。（在TOR之前它没有代理工作）请帮助我，我仍然不了解如何通过Scrapy使用TOR。我应该使用哪一套TOR以及如何使用？我希望我的两个问题都能得到解决

如果scrapy抓取工具由于某种原因挂起（连接失败），我想从那里恢复服务
如何在Scrapy中使用旋转IP

Answer 1

TOR本身不是http代理，端口8118和连接拒绝错误表明你没有正确运行privoxy [1]。尝试正确设置privoxy，然后使用环境变量http_proxy=http://localhost:8118再次尝试。

我已成功使用privoxy和scrapy爬行TOR。

[1] http://www.privoxy.org/

使用tor与scrapy框架

1 个答案: