我正在学习Python Scrapy,并且正在努力找出为什么未执行请求的回调函数。工作流程过程会刮擦一个网站。如果找到项目页面,则程序将测试以查看该网页是否具有活动的登录会话。如果不是,则调用一个函数进行登录。遇到的问题是会话会随着时间过期,并且我的功能需要在一段时间后登录。任何帮助或指导将不胜感激。
我要调用的函数如下:
def retrylogin_parse(self, response):
#FUNCTION NOT CALLED
self.logger.debug("Re-Login attempted for url " + self.login_url)
return [FormRequest.from_response(response, formid= 'login-form', formdata=
{'login[username]': self.username,
'login[password]': self.password},
clickdata = { "type": "submit" }, callback=self.after_relogin)
]
我试图确定为什么以下代码行未调用以下函数retrylogin_parse:
yield Request(self.login_url, dont_filter=True, callback=self.retrylogin_parse)
以下是代码:
import ....
class MySpider(CrawlSpider):
name = "bot-help"
allowed_domains = ['www.somewebsite.com']
start_urls =["https://www.somewebsite.com/category/subcategory.html"]
reloginCurrentUrl = ""
username = 'username'
password = '1234'
login_msg = "Welcome"
login_url = "https://www.somewebsite.com/login/"
rules = (
Rule(LinkExtractor(allow=('html')), callback='item_page'),
)
def item_page(self, response):
image_item = Item()
self.logger.info("item_page Called")
str1 = response.xpath("//p[@class='welcome-msg']/text()").extract_first()
self.logger.info("Testing if Response is still logged in")
self.logger.debug("Message: " + str1)
if (str1.find(self.login_msg)==-1):
self.logger.error("Session Lost! Must Login")
self.reloginCurrentUrl = response.url
image_item['manu_product_url'] = response.url
self.logger.debug("reloginCurrentUrl: " + self.reloginCurrentUrl)
#HERE IS WHERE I WANT TO RE-LOGIN
x = self.start_relogin(response)
self.logger.debug("Relogin Request completed")
return
else:
self.logger.info("Login Session is alive")
self.logger.info("worked")
#SCRAPE DATA.....
yield image_item
def __init__(self, **kwargs):
CrawlSpider.__init__(self, **kwargs)
def start_relogin(self, response):
self.logger.debug("start_relogin function called")
x = response.url
self.logger.debug("Login Url: " + self.login_url)
yield Request(self.login_url, dont_filter=True, callback=self.retrylogin_parse)
def retrylogin_parse(self, response):
#FUNCTION NOT CALLED
self.logger.debug("Re-Login attempted for url " + self.login_url)
return [FormRequest.from_response(response, formid= 'login-form', formdata={'login[username]': self.username, 'login[password]': self.password}, clickdata = { "type": "submit" }, callback=self.after_relogin)]
def after_relogin(self, response):
self.logger.info("Post Re-Login Attempted")
str1 = response.xpath("//p[@class='welcome-msg']/text()").extract_first()
if (str1.find(self.login_msg) == -1):
self.logger.info("Re-Login failed")
return
else:
self.logger.info("Re-Login successful will conitnue to parse")
return [Request(url=self.reloginCurrentUrl)]
这是调试输出:
DEBUG: Crawled (200) <GET https://www.somewebsite.com/robots.txt> (referer: None)
2018-10-30 23:49:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.somewebsite.com/category/subcategory.html> (referer: None)
2018-10-30 23:49:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.somewebsite.com/category/abc-46467.html> (referer: https://www.somewebsite.com/category/subcategory.html)
2018-10-30 23:49:35 [foagroupbot-help] INFO: item_page Called
2018-10-30 23:49:35 [foagroupbot-help] INFO: Testing if Response is still logged in
2018-10-30 23:49:35 [foagroupbot-help] DEBUG: Message: Login Please!
2018-10-30 23:49:35 [foagroupbot-help] ERROR: Session Lost! Must Login
2018-10-30 23:49:35 [foagroupbot-help] DEBUG: reloginCurrentUrl: https://www.somewebsite.com/category/abc-46467.html
2018-10-30 23:49:35 [foagroupbot-help] DEBUG: Relogin Request completed