我正在尝试与Selenium一起编写一个Scrapy蜘蛛来访问我正在抓取的页面上的一些JavaScript内容。我已设法使用Selenium打开页面并等待内容出现。现在我想从完全加载的页面构建一个Scrapy TextResponse
。我的代码看起来像这样(我删除了URL和选择器字符串,它们无关紧要):
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class EexSpider(scrapy.Spider):
name = "eex"
allowed_domain = ["..."]
start_urls = ["..."]
def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
self.driver.get(response.url)
wait = WebDriverWait(self.driver, 10)
element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '...')))
# this is where things go wrong
print response.url # prints the correct url
text_response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
# NameError: name 'response' is not defined
当我运行爬虫时,我在调用NameError: name 'response' is not defined
构造函数的行中收到错误TextResponse
。奇怪的是,我能够成功地在之前的行中打印response.url
。
有人知道为什么会这样吗?
P.S。让我知道如果你想看到堆栈跟踪,我只是不想让问题显得更长。
免责声明:我是一个完整的Python菜鸟; - )
答案 0 :(得分:1)
检查包含TextResponse
的行是否正确缩进。
例如,如果我有以下代码:
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class EexSpider(scrapy.Spider):
name = "eex"
allowed_domain = ["google.com"]
start_urls = ["http://google.com"]
def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
self.driver.get(response.url)
text_response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
我得到完全相同的错误:
NameError:name' response'未定义