Question

我在Windows系统上创建了一个文本文件，我认为默认编码样式是ANSI，文件内容如下所示：

This is\u2019 a sample text file \u2014and it can ....

我使用Windows的默认编码样式保存了这个文件，虽然也有UTF-8，UTF-16等编码样式。

现在我想写一个简单的java函数，我将传递一些输入字符串并用相应的ascii值替换所有的unicodes。

例如： - \u2019 should be replaced with "'" \u2014 should be replaced with "-" and so on.

观察： 当我创建像这样的字符串文字

  String s = "This is\u2019 a sample text file \u2014and it can ....";

我的代码工作正常，但是当我从文件中读取它时，它无法正常工作。我知道在Java String中使用UTF-16编码。

以下是我用来读取输入文件的代码。

FileReader fileReader  = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();

我也尝试使用InputStream and setting the Charset to UTF-8，但结果仍然相同。

替换代码：

public static String removeUTFCharacters(String data){      
        for(Entry<String,String> entry : utfChars.entrySet()){
            data=data.replaceAll(entry.getKey(), entry.getValue());
        }
        return data;
    }

地图：

    utfChars.put("\u2019","'");
    utfChars.put("\u2018","'");
    utfChars.put("\u201c","\"");
    utfChars.put("\u201d","\"");
    utfChars.put("\u2013","-");
    utfChars.put("\u2014","-");
    utfChars.put("\u2212","-");
    utfChars.put("\u2022","*");

任何人都可以帮助我理解这个问题的概念和解决方案。

Answer 1

将转义序列\ uXXXX与正则表达式匹配。然后使用替换循环将该转义序列的每次出现替换为字符的解码值。

因为Java字符串文字使用\来引入转义，所以序列\\用于表示\。此外，Java正则表达式语法专门处理序列\u（表示Unicode转义）。因此，\必须再次转义，并添加\\。因此，在模式中，"\\\\u"实际上意味着“在输入中匹配\u。”

要匹配数字部分（四个十六进制字符），请使用模式\p{XDigit}，使用额外\转义\。我们希望轻松地将十六进制数字作为一组提取，因此它将括在括号中以创建捕获组。因此，模式中的"(\\p{XDigit}{4})"表示“匹配输入中的4个十六进制字符，并捕获它们。”

在循环中，我们搜索模式的出现次数，用解码的字符值替换每次出现。通过解析十六进制数来解码字符值。 Integer.parseInt(m.group(1), 16)表示“将前一场比赛中捕获的组解析为基数为16的数字。”然后使用该字符创建替换字符串。替换字符串必须进行转义或引用，如果它是$，这在替换文本中具有特殊含义。

String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
  String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
  m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);

Answer 2

如果您可以使用其他库，则可以使用apache commons https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"

使用ASCII替换Unicode

2 个答案: