我使用的是Python 2.7和Scrapy 1.3.0
我需要设置代理才能访问网站
如何设置?
这是我的解析脚本
if theurl not in self.ProcessUrls:
self.ProcessUrls.append(theurl)
yield scrapy.Request(theurl, callback=self.parse)
如果我需要确认抓取新的不重复怎么办?如果不重复则需要抓取这个新网址
答案 0 :(得分:1)
我们可以使用以下内容:
request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request
一个简单的实现如下:
import scrapy
class MySpider(scrapy.Spider):
name = "examplespider"
allowed_domains = ["somewebsite.com"]
start_urls = ['http://somewebsite.com/']
def parse(self, response):
# Here example.com is used. We usually get this URL by parsing desired webpage
request = scrapy.Request(url='example.com', callback=self.parse_url)
request.meta['proxy'] = "host:port"
yield request
def parse_url(self, response):
# Do rest of the parsing work
pass
如果您想在初始时使用代理:
将以下内容添加为蜘蛛类字段
class MySpider(scrapy.Spider):
name = "examplespider"
allowed_domains = ["somewebsite.com"]
start_urls = ['http://somewebsite.com/']
custom_settings = {
'HTTPPROXY_ENABLED': True
}
然后使用start_requests()
方法,如下所示:
def start_requests(self):
urls = ['example.com']
for url in urls:
proxy = 'some proxy'
yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': proxy})
def parse(self, response):
item = StatusCehckerItem()
item['url'] = response.url
return item
答案 1 :(得分:0)
您必须设置print("|", ("_") * int(round((multi1/2),1)), "|")
,http_proxy
环境变量。请参阅:proxy for scrapy