通过排除特定元素标签

时间:2019-02-05 11:12:51

标签: java html html5 htmlunit

有什么方法可以通过保留某些具有特定标记的元素而不提取而从html提取整个文本。

我试图删除这些元素标签并获取没有它的html内容,然后再次注入它,但这会提取混乱的内容。

HtmlPage pageWithoutTables = parseHtmlFromString(removeByTagName(htmlContent, "table"), webClient);

public static String removeByTagName(final String str, final String tag) {

    return str.replaceAll("<" + tag + "(.+?)</" + tag + ">", "");
}

输入

     <!DOCTYPE html>
 <html>
 <head>
 <style>
 table {
   font-family: arial, sans-serif;
   border-collapse: collapse;
   width: 100%;
 }

  td, th {
   border: 1px solid #dddddd;
   text-align: left;
   padding: 8px;
 }

 tr:nth-child(even) {
   background-color: #dddddd;
 }
 </style>
 </head>
 <body>

 <h2>this sentence would be in bold print.</b>

Below is an example of a very simple page </h2>

anything others .............etc...

<table>
 <tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
 </tr>
 <tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
 </tr>
 <tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
 </tr>
 <tr>
<td>Ernst Handel</td>
<td>Roland Mendel</td>
<td>Austria</td>
 </tr>
<tr>
<td>Island Trading</td>
<td>Helen Bennett</td>
<td>UK</td>
</tr>
<tr>
<td>Laughing Bacchus Winecellars</td>
<td>Yoshi Tannamuri</td>
<td>Canada</td>
</tr>
<tr>
<td>Magazzini Alimentari Riuniti</td>
<td>Giovanni Rovelli</td>
<td>Italy</td>
 </tr>
</table>
<br> <br> <br> <br> 

other data that maybe contains the same tag again

 </body>
  </html>

预期输出

this sentence would be in bold print. Below is an example of a very simple 
page
anything others .............etc...
<table>
 <tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
 </tr>
 <tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
 </tr>
 <tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
 </tr>
 <tr>
<td>Ernst Handel</td>
<td>Roland Mendel</td>
<td>Austria</td>
 </tr>
<tr>
<td>Island Trading</td>
<td>Helen Bennett</td>
<td>UK</td>
</tr>
<tr>
<td>Laughing Bacchus Winecellars</td>
<td>Yoshi Tannamuri</td>
<td>Canada</td>
</tr>
<tr>
<td>Magazzini Alimentari Riuniti</td>
<td>Giovanni Rovelli</td>
<td>Italy</td>
 </tr>
</table>

其他可能再次包含相同标签的内容

0 个答案:

没有答案