使用JS解析HTML文本 - 额外节点?

时间:2015-10-12 23:25:17

标签: javascript parsing dom html-parsing paragraph

大家。

我正在构建一个软件,对给定的HTML文本进行一些文本解析,当我从HTML中保存所有段落时,我找到了一个额外的节点。

我已经创建了

 <p id="original_content_js"> Original content via JS:<br> </p>

保存从解析中接收的数据,并将其与已解析的数据(原始文本)进行比较。

这是HTML代码:

<p id="original_content_js">
Original content via JS:<br>
</p>

<div id="original_text">    

        <h3>Molly's Sheep</h3>
        <p>
            Molly had a little sheep. <br>
            Molly didn't like her sheep. Ir was too hairy.<br>
            So Molly took a big knife, and cut all of her sheep's fur.<br>
            Now Molly's sheep is cold.<br>
        </p>
        <p>
            But what Molly did not know, was that her sheep is a magical sheep;<br>
            Molly's sheep grows hair instantly, magically!<br>
            Oh, how wonderful, Molly's sheep,<br>
            Making hair, each and each<br>
            Hair grows quickly after cut,<br>
            That's what the story's all about.
        <p>     
    </div>

这是解析代码:

 var html_text_name = "original_text";
 var html_text = document.getElementById(html_text_name);
 var text_paragaphs = html_text.getElementsByTagName("p");
 for (var x=0; x<text_paragaphs.length; x++){
    document.getElementById("original_content_js").innerHTML += "ABC" +
    text_paragaphs[x].innerHTML + "CBA <br>";
 }

我进入original_content_js段落的结果是:

 Original content via JS:
 ABC Molly had a little sheep. 
 Molly didn't like her sheep. Ir was too hairy.
 So Molly took a big knife, and cut all of her sheep's fur.
 Now Molly's sheep is cold.
 CBA 
 ABC But what Molly did not know, was that her sheep is a magical sheep;
 Molly's sheep grows hair instantly, magically!     
 Oh, how wonderful, Molly's sheep,
 Making hair, each and each
 Hair grows quickly after cut,
 That's what the story's all about. CBA 
 ABC CBA

所以你可以看到我按照预期得到的东西 - 包含在“ABC”和“CBA”中的2个段落,除了最后有另一个空节点。为什么还有另外一个节点?

1 个答案:

答案 0 :(得分:1)

您没有检查段落是否已正确关闭。因此,您的代码会看到三个开放的p标记,并假设有三个段落。最后一个p标签应该是一个封闭的p标签。这是一个问题,因为它将text_paragraphs设置为3而不是2.你需要编写一个正则表达式来检查这个......但要注意......为HTML解析编写正则表达式是一件可怕的事情...而且通常是不可能的准确地做到100%的时间。

编辑:我不是说你不应该写一个正则表达式来检查标签是否根据你的情况正确关闭......我只是说,小心。