Question

我使用var tmp_title = $('title').text();与cheerio.js一起从页面获取标题。

问题，是否有任何可以标准化字符串或删除html实体，如\n\t或\n等？

实施例

\n\t defense.gov news article: thousands lay wreaths at arlington cemetery gravesites\n

向

Thousand lay wreaths at arlington cemetery gravesites

或者有没有办法从页面获取标题？现在谷歌如何标题为<h3>标签或谷歌抓取工具从<title>标签获取标题并删除并标准化标题以获得可读的标题字符串？

Answer 1

我会在以下之间做一些分析：

然后“分析”可以像

一样基本

或者，您不介意依赖某些saas网络服务，您可以查看http://www.diffbot.com/。