Question

BeautifulSoup是一个python库，它有一个名为get_text（）的函数，可以使用解析后的HTML页面，例如：https://pastebin.com/DJwA3S5P

并从中提取所有文本，从而将其转换为：https://pastebin.com/qMqrj8RS

这是该函数可以执行的操作的另一个示例：

如果给出以下内容：

Int -> Int

BeautifulSoup的get_text（）函数将简单地转换为： <span id="sm_flash_225" onclick="sm_flash_process('bail', this,1)" onmouseover="sm_flash_add('bail', this, 1);" onmouseout="sm_flash_remove('bail', this, 1);">bail</span>

换句话说，它需要bail并变成<span id ="some_id" more random stuff...>text</span>。

我有一个网站的HTML文件，其中包含一个大型格式化字符串。我想编写相当于BeautifulSoup的get_text（）的Javascript，以便只获取网页的文本。我可以使用任何第三方库等，我不想重新发明轮子。但是，值得注意的是，我是在Chrome / Firefox网络扩展的上下文中写的，所以我不相信我可以使用每一个第三方库。

我使用以下代码获取了HTML文件：

text

Answer 1

试试这个：

@-moz-document

Answer 2

更安全的是不要将来自其他网站的实时HTML（和JS）插入到您自己的网站上。改为使用DOMParser：

fetch("https://cors-anywhere.herokuapp.com/stackoverflow.com", )
  .then(response => response.text())
  .then(responseText => {
    const responseDocument = (new DOMParser()).parseFromString(responseText, 'text/html');
    console.log(responseDocument.head.textContent);
    console.log(responseDocument.body.textContent);
  });

使用Javascript获取网页文本

2 个答案: