如何在HTML字符串中找到没有结束标记并关闭它的标记?
带有不带标记的标记的HTML字符串:
<html>
<head> </head>
<body>
<p style="margin-top: 0"> dasa </p>
<input size="1" type="text" value="a">
</body>
</html>
到
<html>
<head> </head>
<body>
<p style="margin-top: 0"> dasa </p>
<input size="1" type="text" value="a"> </input>
</body>
</html>
谢谢!
答案 0 :(得分:3)
我有两个选项(我最喜欢第二个。)
<强> 1。 http://home.ccil.org/~cowan/XML/tagsoup
instead of parsing well-formed or valid XML,
parses HTML as it is found in the wild:
poor, nasty and brutish, though quite often far from short.
TagSoup is designed for
people who have to process this stuff using
some semblance of a rational application
design. By providing a SAX interface,
it allows standard XML tools to be applied to even the
worst HTML. TagSoup also includes a command-line processor that reads
HTML files and can generate either clean HTML or well-formed XML
that is a close approximation to XHTML.
这是我们正在使用的工具。我提到了另一种工具,但我没有使用它。
<强> 2。 http://htmlcleaner.sourceforge.net/download.php 强>
只需下载jar文件并解压缩即可。并运行如下所示的jar文件。
例如 - 我有以下内容的Html文件
<table>
<tr>
<td>Wrong Table
它给出了如下所示
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>java -jar htmlcleaner-2.8.
jar src=http://localhost/fun/test.html
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:RequiredParentMissing(true) at tr
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at table
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at tbody
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at tr
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at td
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body><table>
<tbody><tr>
<td>Wrong Table</td></tr></tbody></table></body></html>
我也测试了你的HTML, 输出是
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>java -jar htmlcleaner-2.8.
jar src=http://localhost/fun/test.html
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body>
<p style="margin-top: 0"> dasa </p>
<input size="1" type="text" value="a" />
</body></html>
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>
感谢。
答案 1 :(得分:0)
你可以保留一堆标签。当您遇到一个打开的标签时,将其推入堆栈。当您到达结束标记时,弹出并确保它与您所在的结束标记匹配。如果不是,那就是缺少标签。
答案 2 :(得分:0)
下面的代码对我来说很完美:
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import org.ccil.cowan.tagsoup.Parser;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.io.SAXReader;
import org.dom4j.io.XMLWriter;
import org.xml.sax.SAXException;
public class EmailUtil {
public static String getValidHtml(String html) throws SAXException, DocumentException, IOException {
ByteArrayOutputStream baos = null;
SAXReader reader = new SAXReader(Parser.class.getName());
Document doc = reader.read(new ByteArrayInputStream(html.getBytes()));
baos = new ByteArrayOutputStream();
XMLWriter writer;
writer = new XMLWriter(baos);
writer.write(doc);
return baos == null ? null : baos.toString();
}
}