Question

我希望Scrapy在这种情况下提取'Round Size'。但事实证明，Scrapy无法捕获dl下的任何子节点。

response.xpath('//[@id="termsheet"]/div/section[1]/div/dl/li[2]/dt/span').extract()

Xpath表达式是从Chome inspect生成的。我分别测试表达式，它可以捕获li标签。我在Scrapy中启用了Ajax，它可以捕获其他动态项。是否还有其他原因导致Scrapy数据错过？有没有遇到类似问题的人？

enter image description here

Answer 1

你的xpath和提取是错误的，我可以解释的不多，这里有工作代码

response.xpath('//*[@id="termsheet"]/div/section[1]/div/dl/li[2]/dt/span').extract_first()

除非动态生成内容，否则您必须使用Selenium或scrapy-splash等内容

Answer 2

https://www.seedinvest.com/mf.fire/seed/termsheet加载＆＃34;圆形尺寸＆＃34;使用一些JavaScript，从API端点提取的数据（在本例中为https://www.seedinvest.com/api/v1/entities/mf.fire/deal-fundraising-profile/seed） - 您可以在浏览器的工具＆＃34;面板中检查网络查询，例如在Chrome中< / p>

API端点将数据作为JSON返回（这里有相当多的数据！），因此您可以将其提供给std lib json模块，如下例所示（使用scrapy shell）< / p>

$ scrapy shell https://www.seedinvest.com/api/v1/entities/mf.fire/deal-fundraising-profile/seed
2016-06-06 11:36:56 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
(...)
2016-06-06 11:36:58 [scrapy] DEBUG: Crawled (200) <GET https://www.seedinvest.com/api/v1/entities/mf.fire/deal-fundraising-profile/seed> (referer: None)
(...)
>>> import json
>>> d = json.loads(response.text)
>>> d['funding_round']['escrow_max']
1000000.0

Scrapy xpath没有捕获标签

2 个答案: