如何在Node.js中读取和解析HTML?

时间:2019-01-10 19:56:07

标签: html node.js parsing

我有一个简单的项目。我需要帮助,这是一个相关的项目。我需要读取一个HTML文件,然后将其转换为JSON格式。我想将匹配作为代码和文本。我该如何实现?

这样,我有两个HTML标记

<p>In practice, it is usually a bad idea to modify global variables inside the function scope since it often is the cause of confusion and weird errors that are hard to debug.<br />
If you want to modify a global variable via a function, it is recommended to pass it as an argument and reassign the return-value.<br />
For example:</p>

<pre><code class="{python} language-{python}">a_var = 2

def a_func(some_var):
    return 2**3

a_var = a_func(a_var)
print(a_var)
</code></pre>

mycode:

const fs = require('fs')
const showdown  = require('showdown')

var read =  fs.readFileSync('./test.md', 'utf8')

function importer(mdFile) {

    var result = []
    let json = {}

    var converter = new showdown.Converter()
    var text      = mdFile
    var html      = converter.makeHtml(text);

    for (var i = 0; i < html.length; i++) {
        htmlRead = html[i]
        if(html == html.match(/<p>(.*?)<\/p>/g))
            json.text = html.match(/<p>(.*?)<\/p>/g)

       if(html == html.match(/<pre>(.*?)<\/pre>/g))
            json.code = html.match(/<pre>(.*?)<\/pre>/g

    }

    return html
}
console.log(importer(read))

如何在代码上获得这些匹配项?

新代码:我将所有p标签都写在同一个json中,如何将每个p标签写到不同的json块中?

$('html').each(function(){
    if ($('p').text != undefined) {
        json.code = $('p').text()
        json.language = "Text"
    }
})

2 个答案:

答案 0 :(得分:2)

我建议使用Cheerio。它试图将jQuery功能实现到Node.js。

/(?:<|&lt;)3/g

您应该查看Cheerio并阅读其文档。我觉得它真的很整洁!

  

编辑:针对问题的新部分

您可以遍历每个元素并将其插入到JSON对象数组中,如下所示:

const cheerio = require('cheerio')

var html = "<p>In practice, it is usually a bad idea to modify global variables inside the function scope since it often be the cause of confusion and weird errors that are hard to debug.<br />If you want to modify a global variable via a function, it is recommended to pass it as an argument and reassign the return-value.<br />For example:</p>"

const $ = cheerio.load(html)
var paragraph = $('p').html(); //Contents of paragraph. You can manipulate this in any other way you like

//...You would do the same for any other element you require

因此,得到的JSON对象数组应如下所示:

var jsonObject = []; //An array of JSON objects that will hold everything
$('p').each(function() { //Loop for each paragraph
   //Now let's take the content of the paragraph and put it into a json object
    jsonObject.push({"paragraph":$(this).html()}); //Add data to the main jsonObject    
});

我相信您还应该阅读JSON及其运作方式。

答案 1 :(得分:0)

“ hpq”不是最常见的HTML解析库之一,但我认为它非常适合您的请求,因为它的1行描述是

  

用于将HTML解析和查询为对象形状的实用程序。

https://github.com/aduth/hpq

此实时浏览器页面很好地说明了其功能:

https://aduth.github.io/hpq/

您遇到的问题是它是为浏览器创建的(它需要HTML字符串或DOM元素作为输入),所以我不确定是否将它与node一起使用。