如果所有内容都在同一个函数parse()
中,Scrapy和Selenium会很好用,但因为我需要在解析函数中添加更多代码,我想将抓取部分拆分为分离的函数parse_data()
并使用Request ()但回调根本不起作用。
class MySpider(Spider):
name = "myspider"
start_urls = ["http://example.com/Data.aspx",]
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
item = MyItem()
self.driver.get(response.url)
sel = Selector(response)
buttons = len(self.driver.find_elements_by_xpath("//input[@class='buttonRowDetails']"))
for x in range(buttons):
time.sleep(5)
button = self.driver.find_elements_by_xpath("//input[@class='buttonRowDetails']")[x]
button.click()
time.sleep(5)
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
print '\n\n\nHELLO FROM PARSE'
yield Request(response.url, meta={'item': item}, callback=self.parse_data)
def parse_data(self, response):
item = response.meta['item']
print '\n\nHELLO FROM PARSE_DATA'
答案 0 :(得分:0)
猜测您的请求已被过滤,因为它会转到相同的URL(默认情况下会打开重复的请求过滤中间件)。
使用dont_filter
参数关闭过滤:
yield Request(response.url,
meta={'item': item},
callback=self.parse_data,
dont_filter=True)