我有一个我想要转换为DocumentFragment的字符串。问题是孩子<ul><li>... </li></ul>
被完全剥离了。我不知道为什么会这样。
我需要添加或更新的任何配置?
输入
<div class="faq-content-area">
<p>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</p>
<ul>
<li>A <a target="_self" href="/deposits/savings/rewards-money-market-savings-account.go" id="rmms-prtfaq" name="">Rewards Money Market Savings account</a> to receive the money market savings interest rate booster</li>
<li>An eligibile <a target="_self" href="/credit-cards/overview.go" id="creditcard-prtfaq" name="">Bank of America credit card</a>, such as BankAmericard Cash Rewards™ or BankAmericard Travel Rewards<sup>®</sup>, to receive the credit card rewards bonus</li>
</ul>
<p>After you enroll in Preferred Rewards, you can talk to a specialist to convert your existing money market savings account to a Rewards Money Market Savings account or to open a new credit card account that’s eligible for the rewards bonus.</p>
<p>If you already have a Rewards Money Market Savings account or an eligible credit card, you’ll automatically receive Preferred Rewards benefits after you enroll.</p>
</div>
输出如下
<DIV class="faq-content-area hide">
<P>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</P>
<UL>
</UL>
</DIV>
我不知道为什么会这样。
Java程序
InputStream is = null;
BufferedReader br = null;
InputSource iss = null;
try {
is = ClassLoader.getSystemResourceAsStream("test.txt");
iss = new InputSource (is);
DocumentFragment documentFragment = qaParser.parse(iss);
System.out.println(qaParser.serialize(documentFragment));
try {
Path path = Paths.get("./qaAnswers.txt");
//Files.write(path, sb.toString().getBytes(StandardCharsets.UTF_8));
Files.write(
path,
qaParser.serialize(
qaParser.parse(content)).getBytes(StandardCharsets.UTF_8));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} finally {
if (is != null) {
is.close();
}
if (br != null) {
br.close();
}
}
创建DocumentFragment对象。
DocumentFragment parse(InputSource input) throws Exception {
DOMFragmentParser parser = new DOMFragmentParser();
try {
parser.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe",
true);
parser.setFeature("http://cyberneko.org/html/features/augmentations",
true);
parser.setProperty("http://cyberneko.org/html/properties/default-encoding",
defaultCharEncoding);
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",
true);
parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content",
false);
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment",
true);
parser.setFeature("http://cyberneko.org/html/features/report-errors",
LOG.isTraceEnabled());
} catch (SAXException e) {}
// convert Document to DocumentFragment
HTMLDocumentImpl doc = new HTMLDocumentImpl();
doc.setErrorChecking(false);
DocumentFragment res = doc.createDocumentFragment();
DocumentFragment frag = doc.createDocumentFragment();
parser.parse(input, frag);
res.appendChild(frag);
try {
while(true) {
frag = doc.createDocumentFragment();
parser.parse(input, frag);
if (!frag.hasChildNodes()) break;
if (LOG.isInfoEnabled()) {
LOG.info(" - new frag, " + frag.getChildNodes().getLength() + " nodes.");
}
res.appendChild(frag);
}
} catch (Exception e) {
LOG.error("Error: ", e);
};
return res;
}
序列化功能
// Custom method to serialize HTML.
String serialize(Node node) {
try {
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
StringWriter sw = new StringWriter();
transformer.transform(new DOMSource(node), new StreamResult(sw));
return sw.toString();
} catch (Exception e) {
e.printStackTrace();
return null;
}
}