我希望将包含html命名实体的html块转换为符合xml标准的块,该块使用编号的xml实体,同时保留所有html标记元素。
这是通过测试说明的基本思想:
@Test
public void testEvalHtmlEntitiesToXmlEntities() {
String input = "<a href=\"test.html\">link </a>";
String expected = "<a href=\"test.html\">link </a>";
String actual = SomeUtil.eval(input);
Assert.assertEquals(expected, actual);
}
是否有人知道提供此功能的类?我可以写一个正则表达式迭代非元素匹配并执行:
xlmString += StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(htmlString));
但希望有一种更简单的方法或已经提供此类的课程。
答案 0 :(得分:3)
您是否尝试过使用JTidy?
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setPrintBodyOnly(true); // only print the content
tidy.setXmlOut(true); // to XML
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
虽然我认为它会修复一些HTML代码以防万一。
答案 1 :(得分:3)
这可能对您有用。
private static Map<String, String> entityMap = new HashMap<String, String>();
static{
entityMap.put("nbsp", " ");
entityMap.put("iexcl", "¡");
entityMap.put("cent", "¢");
entityMap.put("pound", "£");
entityMap.put("curren", "¤");
entityMap.put("yen", "¥");
entityMap.put("brvbar", "¦");
entityMap.put("sect", "§");
entityMap.put("uml", "¨");
entityMap.put("copy", "©");
entityMap.put("ordf", "ª");
entityMap.put("laquo", "«");
entityMap.put("not", "¬");
entityMap.put("shy", "­");
entityMap.put("reg", "®");
entityMap.put("macr", "¯");
entityMap.put("deg", "°");
entityMap.put("plusmn", "±");
entityMap.put("sup2", "²");
entityMap.put("sup3", "³");
entityMap.put("acute", "´");
entityMap.put("micro", "µ");
entityMap.put("para", "¶");
entityMap.put("middot", "·");
entityMap.put("cedil", "¸");
entityMap.put("sup1", "¹");
entityMap.put("ordm", "º");
entityMap.put("raquo", "»");
entityMap.put("frac14", "¼");
entityMap.put("frac12", "½");
entityMap.put("frac34", "¾");
entityMap.put("iquest", "¿");
entityMap.put("Agrave", "À");
entityMap.put("Aacute", "Á");
entityMap.put("Acirc", "Â");
entityMap.put("Atilde", "Ã");
entityMap.put("Auml", "Ä");
entityMap.put("Aring", "Å");
entityMap.put("AElig", "Æ");
entityMap.put("Ccedil", "Ç");
entityMap.put("Egrave", "È");
entityMap.put("Eacute", "É");
entityMap.put("Ecirc", "Ê");
entityMap.put("Euml", "Ë");
entityMap.put("Igrave", "Ì");
entityMap.put("Iacute", "Í");
entityMap.put("Icirc", "Î");
entityMap.put("Iuml", "Ï");
entityMap.put("ETH", "Ð");
entityMap.put("Ntilde", "Ñ");
entityMap.put("Ograve", "Ò");
entityMap.put("Oacute", "Ó");
entityMap.put("Ocirc", "Ô");
entityMap.put("Otilde", "Õ");
entityMap.put("Ouml", "Ö");
entityMap.put("times", "×");
entityMap.put("Oslash", "Ø");
entityMap.put("Ugrave", "Ù");
entityMap.put("Uacute", "Ú");
entityMap.put("Ucirc", "Û");
entityMap.put("Uuml", "Ü");
entityMap.put("Yacute", "Ý");
entityMap.put("THORN", "Þ");
entityMap.put("szlig", "ß");
entityMap.put("agrave", "à");
entityMap.put("aacute", "á");
entityMap.put("acirc", "â");
entityMap.put("atilde", "ã");
entityMap.put("auml", "ä");
entityMap.put("aring", "å");
entityMap.put("aelig", "æ");
entityMap.put("ccedil", "ç");
entityMap.put("egrave", "è");
entityMap.put("eacute", "é");
entityMap.put("ecirc", "ê");
entityMap.put("euml", "ë");
entityMap.put("igrave", "ì");
entityMap.put("iacute", "í");
entityMap.put("icirc", "î");
entityMap.put("iuml", "ï");
entityMap.put("eth", "ð");
entityMap.put("ntilde", "ñ");
entityMap.put("ograve", "ò");
entityMap.put("oacute", "ó");
entityMap.put("ocirc", "ô");
entityMap.put("otilde", "õ");
entityMap.put("ouml", "ö");
entityMap.put("divide", "÷");
entityMap.put("oslash", "ø");
entityMap.put("ugrave", "ù");
entityMap.put("uacute", "ú");
entityMap.put("ucirc", "û");
entityMap.put("uuml", "ü");
entityMap.put("yacute", "ý");
entityMap.put("thorn", "þ");
entityMap.put("yuml", "ÿ");
entityMap.put("fnof", "À");
entityMap.put("Alpha", "Α");
entityMap.put("Beta", "Β");
entityMap.put("Gamma", "Γ");
entityMap.put("Delta", "Δ");
entityMap.put("Epsilon", "Ε");
entityMap.put("Zeta", "Ζ");
entityMap.put("Eta", "Η");
entityMap.put("Theta", "Θ");
entityMap.put("Iota", "Ι");
entityMap.put("Kappa", "Κ");
entityMap.put("Lambda", "Λ");
entityMap.put("Mu", "Μ");
entityMap.put("Nu", "Ν");
entityMap.put("Xi", "Ξ");
entityMap.put("Omicron", "Ο");
entityMap.put("Pi", "Π");
entityMap.put("Rho", "Ρ");
entityMap.put("Sigma", "Σ");
entityMap.put("Tau", "Τ");
entityMap.put("Upsi", "Υ");
entityMap.put("Phi", "Φ");
entityMap.put("Chi", "Χ");
entityMap.put("Psi", "Ψ");
entityMap.put("Omega", "Ω");
entityMap.put("alpha", "α");
entityMap.put("beta", "β");
entityMap.put("gamma", "γ");
entityMap.put("delta", "δ");
entityMap.put("epsi", "ε");
entityMap.put("zeta", "ζ");
entityMap.put("eta", "η");
entityMap.put("theta", "θ");
entityMap.put("iota", "ι");
entityMap.put("kappa", "κ");
entityMap.put("lambda", "λ");
entityMap.put("mu", "μ");
entityMap.put("nu", "ν");
entityMap.put("xi", "ξ");
entityMap.put("omicron", "ο");
entityMap.put("pi", "π");
entityMap.put("rho", "ρ");
entityMap.put("sigmaf", "ς");
entityMap.put("sigma", "σ");
entityMap.put("tau", "τ");
entityMap.put("upsi", "υ");
entityMap.put("phi", "φ");
entityMap.put("chi", "χ");
entityMap.put("psi", "ψ");
entityMap.put("omega", "ω");
entityMap.put("theta", "ϑ");
entityMap.put("upsih", "ϒ");
entityMap.put("piv", "ϖ");
entityMap.put("bull", "•");
entityMap.put("hellip", "…");
entityMap.put("prime", "′");
entityMap.put("Prime", "″");
entityMap.put("oline", "‾");
entityMap.put("frasl", "⁄");
entityMap.put("weierp", "℘");
entityMap.put("image", "ℑ");
entityMap.put("real", "ℜ");
entityMap.put("trade", "™");
entityMap.put("alefsym", "ℵ");
entityMap.put("larr", "←");
entityMap.put("uarr", "↑");
entityMap.put("rarr", "→");
entityMap.put("darr", "↓");
entityMap.put("harr", "↔");
entityMap.put("crarr", "↵");
entityMap.put("lArr", "⇐");
entityMap.put("uArr", "⇑");
entityMap.put("rArr", "⇒");
entityMap.put("dArr", "⇓");
entityMap.put("hArr", "⇔");
entityMap.put("forall", "∀");
entityMap.put("part", "∂");
entityMap.put("exist", "∃");
entityMap.put("empty", "∅");
entityMap.put("nabla", "∇");
entityMap.put("isin", "∈");
entityMap.put("notin", "∉");
entityMap.put("ni", "∋");
entityMap.put("prod", "∏");
entityMap.put("sum", "−");
entityMap.put("minus", "−");
entityMap.put("lowast", "∗");
entityMap.put("radic", "√");
entityMap.put("prop", "∝");
entityMap.put("infin", "∞");
entityMap.put("ang", "∠");
entityMap.put("and", "⊥");
entityMap.put("or", "⊦");
entityMap.put("cap", "∩");
entityMap.put("cup", "∪");
entityMap.put("int", "∫");
entityMap.put("there4", "∴");
entityMap.put("sim", "∼");
entityMap.put("cong", "≅");
entityMap.put("asymp", "≅");
entityMap.put("ne", "≠");
entityMap.put("equiv", "≡");
entityMap.put("le", "≤");
entityMap.put("ge", "≥");
entityMap.put("sub", "⊂");
entityMap.put("sup", "⊃");
entityMap.put("nsub", "⊄");
entityMap.put("sube", "⊆");
entityMap.put("supe", "⊇");
entityMap.put("oplus", "⊕");
entityMap.put("otimes", "⊗");
entityMap.put("perp", "⊥");
entityMap.put("sdot", "⋅");
entityMap.put("lceil", "⌈");
entityMap.put("rceil", "⌉");
entityMap.put("lfloor", "⌊");
entityMap.put("rfloor", "⌋");
entityMap.put("lang", "〈");
entityMap.put("loz", "◊");
entityMap.put("spades", "♠");
entityMap.put("clubs", "♣");
entityMap.put("hearts", "♥");
entityMap.put("diams", "♦");
entityMap.put("quot", """);
entityMap.put("amp", "&");
entityMap.put("lt", "<");
entityMap.put("gt", ">");
entityMap.put("OElig", "Œ");
entityMap.put("oelig", "œ");
entityMap.put("Scaron", "Š");
entityMap.put("scaron", "š");
entityMap.put("Yuml", "Ÿ");
entityMap.put("circ", "ˆ");
entityMap.put("tilde", "˜");
entityMap.put("ensp", " ");
entityMap.put("emsp", " ");
entityMap.put("thinsp", " ");
entityMap.put("zwnj", "‌");
entityMap.put("zwj", "‍");
entityMap.put("lrm", "‎");
entityMap.put("rlm", "‏");
entityMap.put("ndash", "–");
entityMap.put("mdash", "—");
entityMap.put("lsquo", "‘");
entityMap.put("rsquo", "’");
entityMap.put("sbquo", "‚");
entityMap.put("ldquo", "“");
entityMap.put("rdquo", "”");
entityMap.put("bdquo", "„");
entityMap.put("dagger", "†");
entityMap.put("Dagger", "‡");
entityMap.put("permil", "‰");
entityMap.put("lsaquo", "‹");
entityMap.put("rsaquo", "›");
}
然后我只是将数据作为DOCTYPE
附加到文档中 StringBuffer buffer = new StringBuffer();
buffer.append("<?xml version=\"1.0\"?> " + " <!DOCTYPE some_name [ ");
Iterator<Entry<String, String>> iterator = entityMap.entrySet().iterator();
while (iterator.hasNext()) {
Entry<String, String> entry = iterator.next();
buffer.append("<!ENTITY " + entry.getKey() + " \"" + entry.getValue() + "\">");
}
buffer.append(" ]>");
convertedData = buffer.toString() + convertedData;
答案 2 :(得分:3)
如果你已经在类路径上有公共语言,请查看EntityArrays
中的数组;它们包含所有实体的映射。
要获取数值,只需在第一个元素(Unicode字符)上使用codePointAt(0)
。
现在您需要一个基于正则表达式的循环来搜索&[^;]+;
。这是非常安全的,因为&
是一个需要转义的特殊字符。如果您需要100%确定,请查找CDATA元素并忽略它们。
答案 3 :(得分:3)
这就是我用完的东西。似乎工作正常:
/**
* Some helper methods for XHTML => HTML manipulation
*
* @author David Maple<d@davemaple.com>
*
*/
public class XhtmlUtil {
private static final Pattern ENTITY_PATTERN = Pattern.compile("(&[^\\s]+?;)");
/**
* Don't instantiate me
*/
private XhtmlUtil() { }
/**
* Convert a String of HTML with named HTML entities to the
* same String with entities converted to numbered XML entities
*
* @param html
* @return xhtml
*/
public static String htmlToXmlEntities(String html) {
StringBuffer stringBuffer = new StringBuffer();
Matcher matcher = ENTITY_PATTERN.matcher(html);
while (matcher.find()) {
String replacement = htmlEntityToXmlEntity(matcher.group(1));
matcher.appendReplacement(stringBuffer, "");
stringBuffer.append(replacement);
}
matcher.appendTail(stringBuffer);
return stringBuffer.toString();
}
/**
* Replace an HTML entity with an XML entity
*
* @param htmlEntity
* @return xmlEntity
*/
private static String htmlEntityToXmlEntity(String html) {
return StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(html));
}
}
和相应的测试:
public class XhtmlUtilTest {
@Test
public void testEvalXmlEscape() {
String input = "link 1 | link2 & & dkdk;";
String expected = "link 1  |  link2 & & dkdk;";
String actual = XhtmlUtil.htmlToXmlEntities(input);
System.out.println(actual);
Assert.assertEquals(expected, actual);
}
@Test
public void testEvalXmlEscape2() {
String input = "<a href=\"test.html\">link </a>";
String expected = "<a href=\"test.html\">link </a>";
String actual = XhtmlUtil.htmlToXmlEntities(input);
System.out.println(actual);
Assert.assertEquals(expected, actual);
}
@Test
public void testEvalXmlEscapeMultiLine() {
String input = "<a href=\"test.html\">link </a>\n<a href=\"test.html\">link </a>";
String expected = "<a href=\"test.html\">link </a>\n<a href=\"test.html\">link </a>";
String actual = XhtmlUtil.htmlToXmlEntities(input);
System.out.println(actual);
Assert.assertEquals(expected, actual);
}
}
答案 4 :(得分:1)
这是我使用的另一种解决方案
/**
* Converts the specified string which is in ASCII format to legal XML
* format. Inspired by XMLWriter by http://www.megginson.com/Software/
*/
public static String convertAsciiToXml(String string) {
if (string == null || string.equals(""))
return "";
StringBuffer sbuf = new StringBuffer();
char ch[] = string.toCharArray();
for (int i = 0; i < ch.length; i++) {
switch (ch[i]) {
case '&':
sbuf.append("&");
break;
case '<':
sbuf.append("<");
break;
case '>':
sbuf.append(">");
break;
case '\"':
sbuf.append(""");
break;
default:
if (ch[i] > '\u007f') {
sbuf.append("&#");
sbuf.append(Integer.toString(ch[i]));
sbuf.append(';');
}
else if (ch[i] == '\t') {
sbuf.append(' ');
sbuf.append(' ');
sbuf.append(' ');
sbuf.append(' ');
}
else if ((int) ch[i] >= 32 || (ch[i] == '\n' || ch[i] == '\r')) {
sbuf.append(ch[i]);
}
}
}
return sbuf.toString();
}