我正在尝试使用带有飞溅和旋转代理的scrapy。这是我的settings.py:
ROBOTSTXT_OBEY = False
BOT_NAME = 'mybot'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
LOG_LEVEL = 'INFO'
USER_AGENT = 'Mozilla/5.0'
# JSON file pretty formatting
FEED_EXPORT_INDENT = 4
# Suppress dataloss warning messages of scrapy downloader
DOWNLOAD_FAIL_ON_DATALOSS = False
DOWNLOAD_DELAY = 1.25
# Enable or disable spider middlewares
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Enable or disable downloader middlewares
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
# Splash settings
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
SPLASH_URL = 'http://localhost:8050'
我在我的蜘蛛中设置了ROTATING_PROXY_LIST:
proxy_list = re.findall(r'(\d*\.\d*\.\d*\.\d*\:\d*)\b',
requests.get("https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list.txt").text)
custom_settings = {'ROTATING_PROXY_LIST': proxy_list}
我开始玩
docker run -p 8050:8050 scrapinghub/splash
。以下是启动请求的启动方式:
def start_requests(self):
urls = [ 'http://example-com/page_1.html', 'http://example-com/page_1.html']
for url in urls:
yield SplashRequest(url,
self.parse_url,
headers={'User-Agent': self.user_agent },
args = {'render_all': 1, 'wait': 0.5}
)
但是,在运行抓取工具时,我看不到任何请求通过Splash。我该如何解决这个问题?
由于 Zin