Question

我正在尝试使用python Scraper获取某些特定网站的一些信息，即某些产品的链接。我正在查看的网站是http://www.ah.nl/producten/verse-kant-en-klaar-maaltijden-salades我正在寻找的链接如下 enter image description here

如果有人访问此网站并检查元素＆＃34; Maaltijdsalades＆＃34;，那么您可以使用XPath语法看到链接在// ul / li下。问题是在同一个HTML代码中，还有另一个地方// ul / li用于我不想要的链接。我使用了下面的蜘蛛，它正好抓住了我不想要的链接。

我正在使用以下蜘蛛

import scrapy

from ah_links.items import AhLinksItem

class AhSpider(scrapy.Spider):
    name = "ah_links"
    allowed_domains = ["ah.nl"]
    start_urls=['http://www.ah.nl/producten/aardappel-groente-fruit', 
    ]

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        item = AhLinksItem()
        item['title'] = sel.xpath('a/@href').extract()
        yield item

我需要帮助才能解决这个问题。感谢。

Answer 1

根据我的理解，您应该搜索子类别块中的列表：

for sel in response.css('nav.subcategorynav li'):
    item = AhLinksItem()
    item['title'] = sel.xpath('.//a/@href').extract()
    yield item

这里我使用的是CSS选择器，但您也可以使用XPath解决它：

response.xpath('//nav[contains(@class, "subcategorynav")]//li')

Answer 2

试

item['title'] = sel.xpath("./a/@href").extract()

编辑，这可以按预期工作

import requests
from lxml.html import fromstring
response = requests.get("http://www.ah.nl/producten/aardappel-groente-fruit")
parsed_response = fromstring(response.text)
for item in parsed_response.xpath(".//ul/li"):
    print item.xpath("a/@href")

使用XPath和python scraper无法获得正确的结果

2 个答案: