Question

我正在使用Scrapy来抓取this page (Booking)中的结果。这样做的目的是使所有Urls到达页面中的所有酒店。

我在Scrapy的Spider中放了：

def __init__(self, *args, **kwargs):
    super(BookingScoreSpider, self).__init__(*args, **kwargs)
    self.start_urls = [kwargs.get('start_url')]

def parse(self, response):
    global url
    print(response.xpath('normalize-space(//a[@class="hotel_name_link url"]/@href)'))
    for hotelurl in response.xpath('normalize-space(//a[@class="hotel_name_link url"]/@href)'):
        url = response.urljoin(hotelurl.extract())
        print(url)

但是循环似乎只包含一个元素（第一家旅馆）...所以蜘蛛正在工作，我得到了第一家旅馆的URL，但是循环没有继续，就好像蜘蛛找到了一样只有一个类别为“ hotel_name_link url”的... 当我查看网页时，却发现了很多这样的项目。

你能帮我吗？

Answer 1

您正在使用XPath 1.0（不幸的是，这里的scrapy使用的是20年历史的技术），并且在XPath 1.0中，应用于节点集的normalize-space（）会忽略除节点集中的第一个节点以外的所有节点。（在当前版本的XPath 3.1中，您可以执行//a[...] ! normalize-space()，将其应用到每个节点并返回字符串序列。

您最好的选择是取出normalize-space（），返回未标准化的节点，然后在调用代码中对其进行处理。

Answer 2

在节点集上调用时，normalize-space()仅应用于第一个节点。

如果只需要提取链接，则简单的解决方法是这样的（使用漂亮打印功能添加缩进）：

>>> response.xpath('//a[@class="hotel_name_link url"]').xpath('normalize-space(@href)').getall()
[
    '/hotel/jp/hoteruwbfnanbahei-men.en-gb.html',
    '/hotel/jp/the-lively-osaka.en-gb.html',
    '/hotel/jp/apahoteru-rizoto-yu-tang-jin-ben-ting-yi-tawa.en-gb.html',
    '/hotel/jp/apollo-couples-apartment-at-namba.en-gb.html',
    '/hotel/jp/dotonbori-apartment-next-jr-namba.en-gb.html',
    '/hotel/jp/sotetsu-fresa-inn-osaka-namba.en-gb.html',
    '/hotel/jp/chuan-huose-mei-guo-cun.en-gb.html',
    '/hotel/jp/da-ban-di-yi-hoteru.en-gb.html',
    '/hotel/jp/hotel-wbf-namba-bunraku.en-gb.html',
    '/hotel/jp/amp-and-hostel-honmachi-east.en-gb.html',
    '/hotel/jp/unizo-inn-shin-osaka.en-gb.html',
    '/hotel/jp/hotel-wbf-honmachi.en-gb.html',
    '/hotel/jp/chuan-flat-xin-zhai-qiao-dong.en-gb.html',
    '/hotel/jp/hiyori-hotel-osaka-namba.en-gb.html',
    '/hotel/jp/noum-osaka.en-gb.html'
]

如果您需要遵循这些链接，则只需将response.follow()与链接节点一起使用，它将做正确的事。

for link in response.xpath('//a[@class="hotel_name_link url"]'):
    yield response.follow(link)

Xpath只给我第一项，而我想要所有这些（使用Scrapy）

2 个答案: