我正在尝试删除此网站: https://www.albertacannabis.org/
为了访问其产品,我必须对会话进行身份验证。问题是,我不确定Scrapy如何使用身份验证,所以我不知道什么是该网站的最佳方法。
登录URL为:https://www.albertacannabis.org/login
我尝试了以下代码:
import scrapy
from scrapy.http import FormRequest
class AlbertaspiderSpider(scrapy.Spider):
name = 'albertaspider'
with open("./alberta_parsed_input.txt", "r") as f:
start_urls = f.readlines()
def parse(self, response):
token = response.xpath('//*[@id="_CRSFform"]/input/@value').extract_first()
return FormRequest.from_response(response,
formdata={'__RequestVerificationToken': token,
'Password': 'foobar',
'UserName': 'foobar'},
callback=self.scrape_pages)
product_code = response.xpath('//*[@id="product-title-0"]/a/@href').extract_first()
print(product_code)
并且:
import scrapy
from scrapy.http import FormRequest
from scrapy.spiders.init import InitSpider
class AlbertaspiderSpider(scrapy.Spider):
name = 'albertaspider'
login_url = "https://www.albertacannabis.org/login"
with open("./alberta_parsed_input.txt", "r") as f:
start_urls = f.readlines()
def init_request(self):
return scrapy.Request(
url=self.login_url,
callback=self.login,
)
def login(self, response):
yield scrapy.FormRequest.from_response(
response=response,
formid='__RequestVerificationToken',
formdata={
'UserName': 'foo@gmail.com',
'Password': 'bar',
},
callback=self.initialized,
)
def parse(self, response):
product_code = response.xpath('//*[@id="product-title-0"]/a/@href').extract_first()
print(product_code)
两者均无济于事。我已经在线阅读了几本指南,但不确定自己在做什么。