很抱歉双重发帖。但我之前的帖子基于Flex:
Flex TextArea - copy/paste from Word - Invalid unicode characters on xml parsing
但是现在我在Java方面发布了这个。
问题是:
我们有一个电子邮件功能(我们的应用程序的一部分),我们在其中创建一个XML字符串&把它放在队列中。另一个应用程序选择它,解析XML&发出电子邮件。
从Word复制/粘贴电子邮件文本(<BODY>....</BODY)
时,我们会收到XML解析器异常:
Invalid character in attribute value BODY (Unicode: 0x1A)
由于我们也使用Java,我试图使用以下方法从String中删除无效字符:
body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");
//删除无效字符
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
for (int i = 0; i < in.length(); i++) {
//NOTE: No IndexOutOfBoundsException caught here; it should not happen.
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
//再次剥离
private String stripNonValidXMLCharacter(String in) {
if (in == null || ("".equals(in))) {
return null;
}
StringBuffer out = new StringBuffer(in);
for (int i = 0; i < out.length(); i++) {
if (out.charAt(i) == 0x1a) {
out.setCharAt(i, '-');
}
}
return out.toString();
}
//替换特殊字符(如果有)
emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C"
+ "\\u000E-\\u001F"
+ "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
+ "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
emailText = emailText.replaceAll(
"[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
emailText = emailText.replaceAll("\\p{C}", "");
但他们仍然无法工作。 XML字符串也以:
开头 <?xml version="1.0" encoding="UTF-8"?>
<EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">
我认为Word文档中有多个标签时会出现问题。喜欢例如。
Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>
生成的xml字符串为:
<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t@t.com" DEST="t@t.com" CC="" BCC="t@t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems. The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.? It still is the same. ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>
请注意“?”是Word文档中有多个选项卡的位置。希望我的问题清楚&amp;有人可以帮助解决问题
由于
答案 0 :(得分:0)
您是否尝试过使用TagSoup / JSoup / JTidy等XML库来清理XML?
答案 1 :(得分:0)
无效(隐藏)字符来自UI(Flex TextArea)。所以必须在UI中处理它,以便它也不会传递给Java。处理和处理使用Flex textArea中的chagingHandler删除它以限制字符。