Question

我使用scrapy和python提取数据。

数据有时包含空格。我正在使用带有xpath的normalize-space来删除这样的空格：

xpath('normalize-space(.//li[2]/strong/text())').extract()

这句话非常好。但是，现在我想将normalize-space与css选择器一起使用。

我试过了：

car['Location'] = site.css('normalize-space(div[class=location]::text)').extract()

如果我删除了normalize-space那么我得到了正确的结果，我得到了空的结果..

请问如何将它与css选择器一起使用？

def normalize_whitespace(str):
        import re
        str = str.strip()
        str = re.sub(r'\s+', ' ', str)
        return str

我称之为这样的功能：

car['Location'] = normalize_whitespace(site.css('div[class=location]::text').extract())

但我得到了空洞的结果。为什么请？

Answer 1

不幸的是，Scath中的CSS选择器无法使用XPath函数。

您可以先将div[class=location]::text CSS选择器翻译为等效的XPath表达式，然后将其作为normalize-space()的输入包装在.xpath()中。

无论如何，因为你只对最终的“空格规范化”字符串感兴趣，你可以在CSS选择器提取的输出上用Python函数实现相同的效果。

def normalize_whitespace(str):
    import re
    str = str.strip()
    str = re.sub(r'\s+', ' ', str)
    return str

如果您在Scrapy项目的某个地方包含此功能，您可以像这样使用它：

    car['Location'] = normalize_whitespace(
        u''.join(site.css('div[class=location]::text').extract()))

或

    car['Location'] = normalize_whitespace(
        site.css('div[class=location]::text').extract()[0])

Answer 2

css（）编译一个xpath，因此您可以将其链接到xpath（）规范化空格，因此请将代码更改为：

car['Location'] = site.css('normalize-space(div[class=location])').xpath('normalize-space(text())').extract()