Question

现在我正在学习如何使用Xpath结合python Scrapy来抓取网站。现在我被困在以下地方：

我正在寻找一个荷兰网站http://www.ah.nl/producten/bakkerij/brood，我想要抓住产品的名称：

所以最终我想要一个csv文件，其中包含所有这些面包的文章名称。如果我检查元素，我会看到这些名称的定义：

我需要找到合适的XPath来提取“AH Tijgerbrood bruin heel”。所以我认为我应该在蜘蛛中做的事情如下：

import scrapy
from stack.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "ah"
    allowed_domains = ["ah.nl"]
    start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
    def parse(self, response):
        for sel in response.xpath('//div[@class="product__description small-7 medium-12"]'):
            item = DmozItem()
            item['title'] = sel.xpath('h1/text()').extract()
            yield item

现在，如果我抓住这只蜘蛛，我不会得到任何结果。我不知道我在这里缺少什么。

Answer 1

您必须使用selenium执行此任务，因为所有元素都在JavaScript中加载：

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40) 
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and @class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
    print elem.text

Answer 2

title = response.xpath('//div[@class="product__description small-7 medium-12"]./h1/text').extract()[0]

python Scrapy中的Xpath选择器

2 个答案: