Question

我认为我使用的是Scrapy错误，但我尝试使用xpath只选择页面上H2的文本并删除内部标记。

例如

<h2>Welcome to my <a href="#">page</a></h2>
<h2>Welcome to my Page</h2>

我尝试过使用//h2//text()，但它会生成一个像这样的数组

item["h2s"] = response.xpath('//h2//text()').extract()

['Welcome to my',
'page',
'Welcome to my Page']

我尝试了多种组合，但似乎无法获得我想要的数组

['Welcome to my page',
'Welcome to my Page']

Answer 1

您可以加入每个h2的所有文本节点：

In [1]: [''.join(h2.xpath(".//text()").extract()) for h2 in response.xpath("//h2")]
Out[1]: [u'Welcome to my page', u'Welcome to my Page']

这个话题也很相关：