我是网络爬虫的初学者,可能会问错一个问题:)为了工作scrapy + Selenium,我创建了中间件
class SeleniumDownloaderMiddleware(object):
def __init__(self):
self.driver = None
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened,\
signals.spider_opened)
crawler.signals.connect(middleware.spider_closed,\
signals.spider_closed)
return middleware
def process_request(self, request, spider):
try:
# JS processing
self.driver.get(request.url)
body = to_bytes(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body,\
encoding='utf-8', request=request)
# CRASH ERROR
except (WebDriverException, NoSuchWindowException):
SeleniumDownloaderMiddleware.spider_opened(self, spider)
self.driver.get(request.url)
body = to_bytes(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body,/
encoding='utf-8', request=request)
def spider_opened(self, spider):
#BAN ON DOWNLOADING
options.add_experimental_option("prefs", {
"download.default_directory": "NUL",
"download.prompt_for_download": False,
})
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
self.driver = webdriver.Chrome(chrome_options=options)
def spider_closed(self, spider):
if self.driver:
self.driver.close()
self.driver.quit()
self.driver = None
现在,来自scrapy的任何请求都首先进入该硒中间件,但是我想不使用此中间件而保存pdf,仅保存到scrapy spider
def parse(self, response):
# PDF
for href in response.css('a[href$=".pdf"]::attr(href)').extract() +\
response.css('a[href$=".PDF"]::attr(href)').extract():
url = response.urljoin(href)
yield Request(url=response.urljoin(href), callback=self.save_pdf,
priority=1)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
self.counter += 1
with open(os.path.join(self.folder, str(self.counter)), 'wb') as
file:
file.write(response.body)
我该如何构建scrapy请求以忽略硒中间件?
答案 0 :(得分:1)
考虑使用现有的scrapy-selenium Scrapy扩展名。它的工作方式使得无需Selenium即可轻松下载特定的URL。
或者,根本不使用硒。通常,无需Slashpy或Selenium即可实现以Scrapy开头的人们想要使用Selenium的目的。查看Can scrapy be used to scrape dynamic content from websites that are using AJAX?
的答案答案 1 :(得分:0)
您可以在process_request中对request.url设置条件,并跳过任何处理。
if request.url.endswith('.pdf'):
pass
这应该传递给下一个中间件,或者您可以在那里直接下载并返回。