我已经在Python中创建了一个脚本,其中使用scrapy与硒结合使用来从其主页解析不同餐厅的链接,然后从其内部页面解析每个餐厅的名称。
当我与硒关联使用scrapy时,回调(或方法之间传递参数)如何在不发送请求的情况下工作?
以下脚本使用self.driver.get(response.url)
覆盖了我无法摆脱的回调:
import scrapy
from selenium import webdriver
from scrapy.crawler import CrawlerProcess
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
class YPageSpider(scrapy.Spider):
name = "yellowpages"
link = 'https://www.yellowpages.com/search?search_terms=Pizza+Hut&geo_location_terms=San+Francisco%2C+CA'
def start_requests(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
yield scrapy.Request(self.link,callback=self.parse)
def parse(self,response):
self.driver.get(response.url)
for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".v-card .info a.business-name"))):
yield scrapy.Request(elem.get_attribute("href"),callback=self.parse_info)
def parse_info(self,response):
self.driver.get(response.url)
elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".sales-info > h1"))).text
yield {"title":elem}
if __name__ == '__main__':
c = CrawlerProcess()
c.crawl(YPageSpider)
c.start()
答案 0 :(得分:2)
@vezunchik已经指出的链接答案几乎可以带您到那里。唯一的问题是,当您使用完全相同的代码时,您将有多个chromedriver实例。要多次重复使用同一驱动程序,可以尝试如下操作。
在您的项目middleware.py
中创建一个文件,然后粘贴以下代码:
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumMiddleware(object):
def __init__(self):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
self.driver = webdriver.Chrome(options=chromeOptions)
def process_request(self, request, spider):
self.driver.get(request.url)
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
如果您想了解chmoedriver如何在可见模式下遍历,请提出一个更新。要让浏览器明显移动,请尝试以下操作:
from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse
from scrapy.xlib.pydispatch import dispatcher
class SeleniumMiddleware(object):
def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def process_request(self, request, spider):
self.driver.get(request.url)
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_closed(self):
self.driver.quit()
使用以下脚本获取必需的内容。通过中间件使用硒,每个URL只会有一个请求(导航)。现在,您可以在Spider中使用Selector()
来获取数据,如下所示。
import sys
# The hardcoded address leads to your project location which ensures that
# you can add middleware reference within crawlerprocess
sys.path.append(r'C:\Users\WCS\Desktop\yourproject')
import scrapy
from scrapy import Selector
from scrapy.crawler import CrawlerProcess
class YPageSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ['https://www.yellowpages.com/search?search_terms=Pizza+Hut&geo_location_terms=San+Francisco%2C+CA']
def parse(self,response):
items = Selector(response)
for elem in items.css(".v-card .info a.business-name::attr(href)").getall():
yield scrapy.Request(url=response.urljoin(elem),callback=self.parse_info)
def parse_info(self,response):
items = Selector(response)
title = items.css(".sales-info > h1::text").get()
yield {"title":title}
if __name__ == '__main__':
c = CrawlerProcess({
'DOWNLOADER_MIDDLEWARES':{'yourspider.middleware.SeleniumMiddleware': 200},
})
c.crawl(YPageSpider)
c.start()
答案 1 :(得分:1)
您的意思是在函数之间传递变量吗?为什么不为此使用meta
?无论有无硒,它都能正常工作。我使用与您相同的代码,只是两个小更新:
def parse(self,response):
self.driver.get(response.url)
for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".v-card .info a.business-name"))):
yield scrapy.Request(elem.get_attribute("href"),
callback=self.parse_info,
meta={'test': 'test'}) # <- pass anything here
def parse_info(self,response):
self.driver.get(response.url)
elem = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".sales-info > h1"))).text
yield {"title": elem, 'data': response.meta['test']} # <- getting it here
因此,它输出:
...
2019-05-16 17:40:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/san-francisco-ca/mip/pizza-hut-473437740?lid=473437740>
{'data': 'test', 'title': u'Pizza Hut'}
...