如何解析html并保留所有换行符?

时间:2013-11-30 00:24:46

标签: java jsoup

我有一个包含<br/> , <p> , and <table>元素的文档

我一直在尝试使用Jsoup保留行来解析此HTML。

我从many methods尝试了similar questions,但没有结果

FileInputStream in = new FileInputStream("C:............xxx.htm");
        String htmlText = IOUtils.toString(in);

        File file = new File("C:............xxx.txt") ;
        PrintWriter pr = new PrintWriter(file) ; 

        String text = Jsoup.parse(htmlText.replaceAll("(?i)<br[^>]*>", "br2n")).text();
        System.out.println(text.replaceAll("br2n", "\n"));
        pr.println(text.replaceAll("br2n", "\n"));

//        for (String line : htmlText.split("\n")) {
//            String stripped = Jsoup.parse(line).text();
//            
//            System.out.println(stripped);
//            pr.println(stripped);
//              
//        }

        pr.close();


以下是我的HTML文件的代表部分(原始文件以<html>开头...当然)

    <table border="0" cellspacing="0" cellpadding="0" bgcolor="white"
    width='650'>
    <tr>
    <td><font size="4"><br />
    &nbsp;<b>The scientific explantion of the syndrom</b></font>
    <table width='650' border="0" cellspacing="5" cellpadding="0">
    <tr>
    <td width='5%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='15%'>&nbsp;</td>
    <td width='30%'>&nbsp;</td>
    </tr>
    <tr height="24">
    <td align="left" nowrap="nowrap" colspan="3"><font size=
    "3"><b>Recent Update</b></font></td>
    <td align="left" nowrap="nowrap"><a name=
    "9J003346248"></a><font size="3"><b>Issue:</b></font></td>
    <td align="left"><font size="3">9569865248</font></td>
    </tr>
    <tr>
    <td>&nbsp;</td>
    <td align="left"><b>Locust:</b></td>
    <td align="left" colspan="3">UYF78UIGK</td>
    </tr>

    </table>

    <br/> The explanation above does not necc....... <p> 
    Blah ....
    </p>

    <table border="2" cellspacing="1" cellpadding="0" bgcolor="white"
    width='750'>
    <tr>
    <td><font size="4"><br />
    &nbsp;<b>Syndrom of the main ......</b></font>
    <table width='650' border="0" cellspacing="5" cellpadding="0">
    <tr>
    <td width='5%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='25%'>&nbsp;</td>
    <td width='15%'>&nbsp;</td>
    <td width='30%'>&nbsp;</td>
    </tr>
    <tr height="24">
    <td align="left" nowrap="nowrap" colspan="3"><font size=
    "3"><b>Data</b></font></td>
    <td align="left" nowrap="nowrap"><a name=
    "9J003346248"></a><font size="3"><b>Issue:</b></font></td>
    <td align="left"><font size="3">9509809248</font></td>
    </tr>
    <tr>
    <td>&nbsp;</td>
    <td align="left"><b>Locust:</b></td>
    <td align="left" colspan="3">U344365GK</td>
    </tr>

</table>

<br/> The explanation above does not necc....... <p> 
Blah ....
</p>

我需要确保这些表中的所有行都像原始文档中那样依次存在。但我有多个表和其他“断行元素”。我怎么能用Jsoup做到这一点?是否有可能解析html并更有效地使用其他api保持线路?

1 个答案:

答案 0 :(得分:0)

你几乎是对的。试试这个

String text = Jsoup.parse(htmlText.replaceAll("(?i)</tr>", "</tr> br2n ").replaceAll("(?i)<br[^>]*>", "br2n")).replaceAll("(?i)<p>", "<p> br2n ").replaceAll("(?i)</p>", "</p> br2n ").text();
  System.out.println(text.replaceAll("br2n", "\n"));