假设我有一个网站,其中包含html源代码,其结构如下:
<html>
<head>
....
<table id="xxx">
<tr>
..
</table>
我已经应用该库来关闭所有html标签。您能告诉我哪些库或正则表达式可以从html源中提取所有文本,这些文本以<table>
开始,以</table>
结尾
使用node.js吗?
下面是我的代码
console.log('todo list RESTful API server started on: ' + port);
var request = require('request');
var cheerio = require('cheerio');
request('https://fpp.mpfa.org.hk/tc_chi/mpp_list.jsp', function (error, response, body) {
console.log('error:', error); // Print the error if one occurred
console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
var sanitizeHtml = require('sanitize-html');
var dirty = body.match(/\[(.*)\]/).pop();
var clean = sanitizeHtml(dirty, {
allowedTags: [ ],
allowedAttributes: {
},
allowedIframeHostnames: ['www.youtube.com']
});
console.log('body:', clean); // Print the HTML for the Google homepage.
});
答案 0 :(得分:1)
您只需要使用cheerio的API来获取<table>
,然后打印出文本节点即可。
给出页面的以下HTML:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Contacts</title>
</head>
<body>
<main>
<h1>Hello</h1>
<section>
<h2>World</h2>
<table>
<tr>
<td>foo</td>
<td>bar</td>
<td>fizz</td>
</tr>
<tr>
<td>buzz</td>
<td>hello</td>
<td>world</td>
</tr>
</table>
</section>
</main>
</body>
</html>
并运行以下代码:
const request = require("request");
const cheerio = require("cheerio");
const URL_TO_PARSE = "http://localhost/my-page.html";
// Make a request to get the HTML of the page
request(URL_TO_PARSE, (err, response, body) => {
if (err) throw new Error("Something went wrong");
// Load the HTML into cheerio's DOM
const $ = cheerio.load(body);
// Print the text nodes of the <table> in the HTML
console.log($("table").text());
});
将产生以下输出:
foo
bar
fizz
buzz
hello
world
然后您可以根据需要对其进行操作。 Cheerio使用与jQuery非常相似的API。