我有一个包含<br/> , <p> , and <table>
元素的文档
我一直在尝试使用Jsoup
和保留行来解析此HTML。
我从many methods尝试了similar questions,但没有结果
FileInputStream in = new FileInputStream("C:............xxx.htm");
String htmlText = IOUtils.toString(in);
File file = new File("C:............xxx.txt") ;
PrintWriter pr = new PrintWriter(file) ;
String text = Jsoup.parse(htmlText.replaceAll("(?i)<br[^>]*>", "br2n")).text();
System.out.println(text.replaceAll("br2n", "\n"));
pr.println(text.replaceAll("br2n", "\n"));
// for (String line : htmlText.split("\n")) {
// String stripped = Jsoup.parse(line).text();
//
// System.out.println(stripped);
// pr.println(stripped);
//
// }
pr.close();
以下是我的HTML文件的代表部分(原始文件以<html>
开头...当然)
<table border="0" cellspacing="0" cellpadding="0" bgcolor="white"
width='650'>
<tr>
<td><font size="4"><br />
<b>The scientific explantion of the syndrom</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Recent Update</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9569865248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">UYF78UIGK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
<table border="2" cellspacing="1" cellpadding="0" bgcolor="white"
width='750'>
<tr>
<td><font size="4"><br />
<b>Syndrom of the main ......</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Data</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9509809248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">U344365GK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
我需要确保这些表中的所有行都像原始文档中那样依次存在。但我有多个表和其他“断行元素”。我怎么能用Jsoup做到这一点?是否有可能解析html并更有效地使用其他api保持线路?
答案 0 :(得分:0)
你几乎是对的。试试这个
String text = Jsoup.parse(htmlText.replaceAll("(?i)</tr>", "</tr> br2n ").replaceAll("(?i)<br[^>]*>", "br2n")).replaceAll("(?i)<p>", "<p> br2n ").replaceAll("(?i)</p>", "</p> br2n ").text();
System.out.println(text.replaceAll("br2n", "\n"));