我正在编写一个scrapy-splash程序,需要单击网页上的显示按钮(如下图所示),以便显示第10版的数据,因此可以对其进行抓取。我有下面尝试过的代码,但是没有用。我需要的信息只有在单击显示按钮后才能访问。 更新:仍在为此而苦苦挣扎,我必须相信有一种方法可以做到这一点。我不想抓取JSON,因为这可能是网站所有者的一个危险信号。
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': 'email123@example.com', 'ex_usr_pass': 'password123'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button= response.xpath('//a[contains(., "- Display>>")]/@href').get()
response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item
答案 0 :(得分:7)
您的代码无法使用,因为没有锚元素和href属性。单击该按钮会将XMLHttpRequest
发送到http://www.starcitygames.com/buylist/search?search-type=category&id=5061
,然后在JSON响应中找到所需的数据。
Display
。Headers
标签中,您将找到请求URL;在Preview
或Response
标签中,您可以检查JSON。id
来构建请求URL。您可以通过解析XPath script
//script[contains(., "categories")]
元素来找到它
http://www.starcitygames.com/buylist/search?search-type=category&id=5061
,并获取所需的数据。$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....
如您所见,您甚至无需登录网站或Splash
。