我在激活代理中间件的情况下使用Scrapy来搜寻搜索引擎的搜索结果页面。但是,每当我发出请求时,我的IP都会被阻止。
我已经尝试过:
没有任何作用
我的settings.py
username = 'USERNAME'
password = 'password'
port = 'PORT'
session_id = random.random()
super_proxy_url = ('http://%s-session-%s:%s@zproxy.lum-superproxy.io:%d' %
(username, session_id, password, port))
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_MODE = 2
CUSTOM_PROXY = super_proxy_url
RETRY_TIMES = 10
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408, 429]
我的蜘蛛:
class RelSpider(EmailsSpider):
name = 'rel_spider'
custom_settings = {
'ITEM_PIPELINES': {
'scott_moses_emails.pipelines.RelPipeline': 400
}
}
def __init__(self, start_urls ='', relevanceKeyword ='', rel=False):
super(RelSpider, self).__init__(rel = False)
self.start_urls = self.remove_https_www(self.read_start_urls_file(self.flags['start_urls']))
self.keyword = relevanceKeyword
def start_requests(self):
settings=get_project_settings()
for url in self.start_urls:
base_url = 'https://www.google.com/search?q=site%3A'
no_country_redirect = '&pws=0&gl=us&gws_rd=cr'
google_url = base_url + url + no_country_redirect
print google_url
yield scrapy.Request(url=google_url, callback=self.parse_count, dont_filter = True,
meta = {'root': url})
我收到的是一条错误消息,通知我我的IP被阻止,这令人惊讶,因为我认为代理会隐藏我的IP。
请帮助
答案 0 :(得分:0)
要使用代理,必须在相应的请求上定义Set<String> titles = new HashSet<>(Arrays.asList("First Name", "Last Name"));
元密钥。 Check the documentation。