我正在尝试将所有Windows特殊字符转换为其Unicode等效字符。我们有一个Flex应用程序,用户可以在其中保存一些Rich Text,然后通过Java Emailer通过电子邮件发送给他们的收件人。但是,我们继续遇到Word的特殊字符,这些字符只是在电子邮件中显示为?。
到目前为止,我已经尝试了
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
哪个有效,但我不想将所有Windows-1252字符编码为等效的UTF-16(假设这是默认的Java字符集)
然而,我们的用户不断从Microsoft Word中找到Java无法处理的更多字符。所以我搜索并搜索了这个例子
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
但是当我在Eclipse调试器中观察编码时,没有任何变化。
必须有一个简单的解决方案来处理微软可爱的Java编码。
有什么想法吗?
答案 0 :(得分:4)
您可以尝试使用java.nio.charset.Charset
:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));
答案 1 :(得分:2)
使用以下步骤:
InputStreamReader
OutputStreamWriter
BufferedReader
和BufferedWriter
逐行编写内容。所以你的代码可能如下所示:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
当然,这会排除try / catch内容并将其委托给调用者。
如果您只是尝试将内容作为一系列排序,则可以将writer
替换为StringWriter
并返回其toString
值。然后,您不需要目标流或编码,只需要转储字符的位置:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}
答案 2 :(得分:1)
到目前为止,我测试的所有内容似乎都有效:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}