您好我正在使用scrapy登录一些随机网站。我按照scrapy的教程进行操作,但似乎并没有起作用。当我尝试它时,我注意到" isAuthenticated":False。我返回的html主体并不包含实际网站所做的一切。我不确定问题是什么。我认为这是CSRFtoken但经过研究我发现scrapy应该处理它。这是下面的代码。有什么建议吗?
import scrapy
import sys
from scrapy import Spider
from scrapy import Request
class IvanaSpider(Spider):
name = 'ivanaSpider'
def start_requests(self):
return [scrapy.FormRequest(
'https://bitbucket.org/account/signin/?next=/',
formdata={'username': 'username', 'password': 'password',
'form_build_id': 'form - v14V92zFkSSVFSerfvWyH1WEUoxrV2khjfhAETJZydk',
'form_id': 'account_api_form',
'op': 'Sign in'
},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "It's recommended that you log in" in response.body:
print "------------------------------------------"
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
for line in response.xpath('//body').extract():
print line.encode(sys.stdout.encoding, errors='replace')
答案 0 :(得分:0)
要登录网站,您需要使用 FormRequest ,但对于某些网站,例如 bitbucket ,
他们使用预定义的表单属性,如CSRFtoken,会话信息和其他令牌,只能在用户访问的上一页中使用
在这种情况下,可以使用FormRequest.from_response scrapy方法从响应中收集所有预定义的params并将其作为formdata发布
# For example
import scrapy
import sys
from scrapy import Spider
from scrapy import Request
class IvanaSpider(Spider):
name = 'ivanaSpider'
start_urls = (
'https://bitbucket.org/account/signin/?next=/',
)
def parse(self, response):
yield scrapy.FormRequest.from_response(
response=response,
formdata={"username": "<your username>",
"password": "<your password>"},
#formname="login",apparently there are many socal login forms so select one based on xpath ( form id)
formxpath=".//form[@id='aid-login-form']",
callback=self.after_login,
dont_click=True,
)
def after_login(self, response):
# check login succeed before going on
if "It's recommended that you log in" in response.body:
print "------------------------------------------"
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
for line in response.xpath('//body').extract():
print line.encode(sys.stdout.encoding, errors='replace')