Question

我想使用尽可能通用的Node.js实现基本的Web抓取工具。我希望应用程序能够解析并从任何HTML返回文本，而无需任何标记/ CSS /脚本，而不必知道提前解析的HTML结构。

我一直在使用这个库：

使用以下代码，我能够从body标签提取文本，但是其中也包含CSS和JavaScript。仅提取文本而不包含CSS / JavaScript的最佳方法是什么？

代码：

import os
from os import path
# has no effect, presumably because this needs to be set before python starts
os.environ['LD_LIBRARY_PATH'] = path.abspath(path.dirname(__file__))  

import prog
prog.pyhelloworld()

Answer 1

查看其他答案我已经看到您可以使用正则表达式来这样做，这是一个示例：

let scriptRegex = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
let styleRegex = /((<style>)|(<style type=.+))((\s+)|(\S+)|(\r+)|(\n+))(.+)((\s+)|(\S+)|(\r+)|(\n+))(<\/style>)/g;

// An example html content
const str = `
my cool html content
<style>
...
</style>
my cool html content
<style type="text/css">
...
</style>
my cool html content
<script> 
... 
</script>
my cool html content`;

// Strip the tags from the html
let result = str.replace(scriptRegex, '');
result = result.replace(styleRegex, '');

// There you go :)
console.log('Substitution result: ', result);

希望有帮助！

Answer 2

我相信cherio.load（body）正在为您提供DOM。如果是这样，您可以使用innerText这样的东西：

    // Parse the document body
    const jsdom = require(jsdom);
    const dom = jsdom.JSDOM(cheerio.load(body),{"url": pageToVisit}).window.document.body;
    console.log(dom.innerText);

如果cherio为您提供HTML，则可以使用JSDOM将其转换为DOM，如下所示：

curl -d '{"userName": "Tom and Jerry"}' -H "Content-Type: application/json" -H "Authorization: Bearer dwgqhsfjnfjjfldklkfldskglkylrkylktyl" -X POST http://nodejs-appfactory1:25002/appurl-service/api/appurl/getClientList

使用Node.js实现通用Web爬网程序

2 个答案: