Question

我对此非常陌生，一直试图让我的头围绕着我的第一个选择器。有人能帮助我吗？我正在尝试从此页面中提取数据：

http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false

div class = list clearfix shelfListing下的所有信息，但我似乎无法弄清楚如何格式化response.xpath()。

我已经成功启动了scrapy控制台，但无论我在response.xpath()输入什么，我似乎无法选择正确的节点。我知道它有效，因为当我输入

时

>>>response.xpath('//div[@class="container"]')

我收到回复。然而，我不知道如何导航到列表cleardix货架列表。我希望，一旦我得到这一点，我就可以继续通过蜘蛛。

PS我想知道是否无法扫描此网站 - 所有者是否有可能阻止蜘蛛？

Answer 1

div listings类（和id）内的内容是异步加载的XHR请求。换句话说，Scrapy获取的html代码不包含它：

$ scrapy shell http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
>>> response.xpath('//div[@id="listings"]')
[]

使用浏览器开发人员工具，您可以看到带有一堆GET参数的http://groceries.asda.com/api/items/viewitemlist url请求。

一种选择是模拟该请求并解析生成的JSON：

enter image description here

如何做到这实际上是另一个问题的一部分。

以下是使用selenium包的一种可能解决方案：

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false')

div = driver.find_element_by_id('listings')
for item in driver.find_elements_by_xpath('//div[@id="listings"]//a[@title]'):
    print item.text.strip()

driver.close()

打印：

Kellogg's Coco Pops
Kelloggs Rice Krispies
Kellogg's Coco Pops Croco Copters
...

我的第一个scrapy xpath选择器

1 个答案: