Question

我是Scrapy的新手，我刚刚开始研究XPath。

我试图从div中的html列表项中提取标题和链接。下面的代码就是我以为我是怎么做的，（通过id选择ul div，然后循环遍历列表项）：

def parse(self, response):
    for t in response.xpath('//*[@id="categories"]/ul'):
        for x in t.xpath('//li'):
            item = TgmItem()
            item['title'] = x.xpath('a/text()').extract()
            item['link'] = x.xpath('a/@href').extract()
            yield item

但是我收到了与此次尝试相同的结果：

def parse(self, response):
    for x in response.xpath('//li'):
        item = TgmItem()
        item['title'] = x.xpath('a/text()').extract()
        item['link'] = x.xpath('a/@href').extract()
        yield item

导出的csv文件包含源代码从上到下的li数据...

我不是专家，而且我做过多次尝试，如果有人能够对此有所了解，我们将不胜感激。

Answer 1

您需要使用点开始在内部循环内使用的xpath表达式：

for t in response.xpath('//*[@id="categories"]/ul'):
    for x in t.xpath('.//li'):

这将使其在当前元素的范围内进行搜索，而不是整个页面。

请参阅Working with relative XPaths的更多解释。

Scrapy / Python / XPath - 如何从数据中提取数据？

1 个答案: