Question

xpath句子：

item['title'] = response.xpath('//span[@class="title"]/text()').extract_first()
item['content'] = response.xpath('//div[@class="content"]').extract_first()

结果：

{
'title': '\t史蒂芬霍金',
'content': '<div class="content"><div>能够在过去这么多年的时间里研究并学习宇宙学<br>\r\n对我来说意义非凡</div></div>'
}

问题：

1，如何删除\t字段中的title？
2，如何删除<div class="content"></div>字段中的content？（无法删除子节点。）

Answer 1

您可以使用Python的strip()作为标题：

item['title'] = response.xpath(
                    '//span[@class="title"]/text()').extract_first().strip()

您可以使用XPath的string()或normalize-space()链接您的选择器以获取内容：

item['content'] = response.xpath(
                      '//div[@class="content"]').xpath('string(.)').extract_first()

Answer 2

item['content'] = response.xpath('string(//div[@class="content"])').extract_first()

string()将连接当前节点中的所有文本。

如果你想摆脱空格，你可以使用normalize-space()，它就像建立在strip()之上的python＆＃39; string()：< / p>

item['content'] = response.xpath('normalize-space(//div[@class="content"])').extract_first()

关于使用scrapy时的xpath

2 个答案: