xpath提取特定节点中的所有文本,然后使用scrapy

时间:2019-01-22 18:22:25

标签: python xpath scrapy

所以我有这个html:

<html>
<p>
   This is my first sentence
   <br>
   This sentance should be considered as part of the first one.
   <br>
   And this also
</p>
<p>
   This is the second sentence
</p>
</html>

我想从p个节点中提取文本,一个节点中的所有文本都应作为一个元素返回,我使用的是scrapy shell,如下所示:

scrapy shell path/to/file.html
response.xpath('//p/text()').extract()

我得到的输出是:

[
'This is my first sentence',
'This sentance should be considered as part of the first one.'
'And this also'
'This is the second sentence'
]

我想要的输出:

[
 'This is my first sentence This sentance should be considered as part of the first one And this also'
 'This is the second sentence'
]

有关如何使用xpath表达式解决此问题的任何帮助

非常感谢:))))

2 个答案:

答案 0 :(得分:1)

这解决了问题...

from w3lib.html import remove_tags
two_texts = response.xpath('//p').extract()
two_texts = [remove_tags(text) for text in two_texts]

答案 1 :(得分:1)

或者,您可以按照评论中的建议使用w3lib来避免' '.join()

paragraphs = response.css('p')
paragraphs = [' '.join(p.xpath('./text()').getall()) for p in paragraphs]