Question

我正在使用Scrapy从新闻网站中提取新闻文章的文本。我假设<p>标签中的所有文本都是实际文章。（这不一定是安全的假设，但它是我正在使用的）为了找到所有<p>标签，Scrapy让我使用css选择器，如下所示：

response.css("p::text")

问题是一些新闻网站喜欢在他们的文章中添加大量标记，如下：

<p>
    Senator <a href="/people/senator_whats_their_name">What&#39s-their-name</a> is <em>furious</em> about politics!
</p>

在Scrapy中是否有一个css选择器或其他简单的方法来提取文本并去除所有格式，这样会产生类似的结果？

Senator What's-their-name is furious about politics!

问题在于，理论上这些标签可以任意嵌套：

<p>
    <span class="some-annoying-markup"><a href="who cares"><em>Wow this link must be important </em></a></span>
<p>

我还想提取文字

Wow this link must be important

我知道这是从HTML页面中提取内容的一种非常天真的方式，但这超出了本问题的范围。如果有一个更简单的方法来完成这个，我会接受建议，但我在这个主题上找到的东西似乎比我在这里提到的要复杂得多，所以我只对解决问题感兴趣我已经提出。

Answer 1

In [7]: sel = Selector(text='''<p>
   ...:     Senator <a href="/people/senator_whats_their_name">What&#39s-their-n
   ...: ame</a> is <em>furious</em> about politics!
   ...: </p>''')

In [9]: sel.xpath('normalize-space(//p)').extract_first()
Out[9]: "Senator What's-their-name is furious about politics!"

OR：

In [10]: sel = Selector(text='''<p>
    ...:     <span class="some-annoying-markup"><a href="who cares"><em>Wow this
    ...:  link must be important </em></a></span>
    ...: <p>''')

In [11]: sel.xpath('normalize-space(//p)').extract_first()
Out[11]: 'Wow this link must be important'

使用xpath的string函数连接标记下的所有文本。

normalize-space将删除字符串中的空格。

从任意嵌套的HTML

1 个答案: