Question

我正在使用Python的Scrapy，并希望使用选择器获取html标签内的所有单词。例如，我有这个页面：

<!DOCTYPE html>
<html>
<head>
    <title>My Page</title>
</head>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

<div>Hello 
    <span>World!<b>Yes it is</b></span>
</div>

</body>
</html>

我需要从这个字符串或列表中获取所有单词：

"My Page My First Heading My First paragraph. Hello World! Yes it is"

或

["My", "Page", "My", "First", "Heading", "My", "First", "paragraph.", "Hello", "World!", "Yes", "it", "is"]

甚至没有标点符号的单词。

这该怎么做？我尝试使用response.selector.xpath('//text()').extract()，但却收到许多不需要的结果，例如空字符串，换行符号（\n）等等。

Answer 1

response.xpath('//text()').extract()是解决问题的好方法。您只需要input and output processors的强大功能来过滤空项目，剥离等。

或者，您可以使用re:test()要求至少一个字母数字出现在文本中：

response.xpath('//text()[re:test(., "\w+")]').extract()

示例：

In [1]: map(unicode.strip, response.xpath('//text()[re:test(., "\w+")]').extract())
Out[1]: 
[u'My Page',
 u'My First Heading',
 u'My first paragraph.',
 u'Hello',
 u'World!',
 u'Yes it is']

Scrapy：如何提取HTML标签内的所有单词？

1 个答案: