我正在阅读Java程序中的一些文本文件,并希望用ASCII近似替换一些Unicode字符。这些文件最终将被分解为提供给OpenNLP的句子。 OpenNLP无法识别Unicode字符并在许多符号上给出不正确的结果(它将“girl”标记为“girl”和“s”,但如果它是Unicode引用则将其视为单个标记)..
例如,源语句可能包含Unicode方向引号U2018('),我想将其转换为U0027(')。最终我将剥离剩余的Unicode。
我知道我正在丢失信息,而且我知道我可以编写正则表达式来转换这些符号中的每一个,但我问是否有可以重用的代码来转换这些符号。
这就是我能做到的,但我确信我会犯错误/错过/等等。
// double quotation (")
replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\""));
// single quotation (')
replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));
替换是我后来运行并应用替换的自定义类。
for (Replacement replacement : replacements) {
text = replacement.pattern.matcher(text).replaceAll(r.replacement);
}
如你所见,我必须找到:
答案 0 :(得分:15)
我找到了相当广泛的table that maps Unicode punctuation to their closest ASCII equivalents。
以下是更多信息:Map Symbols & Punctuation to ASCII。
答案 1 :(得分:6)
为每个unicode角色分配一个category。存在两个单独的类别 报价:
使用这些列表,如果您想手动编写正则表达式,则应该能够正确处理所有引号。
Java Character.getType为您提供了字符类别,例如FINAL_QUOTE_PUNCTUATION
。
现在,您可以获取每个(标点符号)字符的类别,并将其替换为ASCII中的相应补充。
您可以相应地使用其他标点符号类别。在'Punctuation, Other'中有一些字符,例如PRIME ′
,您可能还想用撇号替换它们。
答案 2 :(得分:6)
我按照@ marek-stoj的链接创建了一个Scala应用程序,它在保持字符串长度的同时清除字符串中的unicode。它删除变音符号(重音符号)并使用@ marek-stoj建议的映射将非Ascii unicode字符转换为其ascii近似值。
import java.text.Normalizer
object Asciifier {
def apply(string: String) = {
var cleaned = string
for ((unicode, ascii) <- substitutions) {
cleaned = cleaned.replaceAll(unicode, ascii)
}
// convert diacritics to a two-character form (NFD)
// http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD)
// remove all characters that combine with the previous character
// to form a diacritic. Also remove control characters.
// http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
cleaned.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{Cntrl}]", "")
// size must not change
require(cleaned.size == string.size)
cleaned
}
val substitutions = Set(
(0x00AB, '"'),
(0x00AD, '-'),
(0x00B4, '\''),
(0x00BB, '"'),
(0x00F7, '/'),
(0x01C0, '|'),
(0x01C3, '!'),
(0x02B9, '\''),
(0x02BA, '"'),
(0x02BC, '\''),
(0x02C4, '^'),
(0x02C6, '^'),
(0x02C8, '\''),
(0x02CB, '`'),
(0x02CD, '_'),
(0x02DC, '~'),
(0x0300, '`'),
(0x0301, '\''),
(0x0302, '^'),
(0x0303, '~'),
(0x030B, '"'),
(0x030E, '"'),
(0x0331, '_'),
(0x0332, '_'),
(0x0338, '/'),
(0x0589, ':'),
(0x05C0, '|'),
(0x05C3, ':'),
(0x066A, '%'),
(0x066D, '*'),
(0x200B, ' '),
(0x2010, '-'),
(0x2011, '-'),
(0x2012, '-'),
(0x2013, '-'),
(0x2014, '-'),
(0x2015, '-'),
(0x2016, '|'),
(0x2017, '_'),
(0x2018, '\''),
(0x2019, '\''),
(0x201A, ','),
(0x201B, '\''),
(0x201C, '"'),
(0x201D, '"'),
(0x201E, '"'),
(0x201F, '"'),
(0x2032, '\''),
(0x2033, '"'),
(0x2034, '\''),
(0x2035, '`'),
(0x2036, '"'),
(0x2037, '\''),
(0x2038, '^'),
(0x2039, '<'),
(0x203A, '>'),
(0x203D, '?'),
(0x2044, '/'),
(0x204E, '*'),
(0x2052, '%'),
(0x2053, '~'),
(0x2060, ' '),
(0x20E5, '\\'),
(0x2212, '-'),
(0x2215, '/'),
(0x2216, '\\'),
(0x2217, '*'),
(0x2223, '|'),
(0x2236, ':'),
(0x223C, '~'),
(0x2264, '<'),
(0x2265, '>'),
(0x2266, '<'),
(0x2267, '>'),
(0x2303, '^'),
(0x2329, '<'),
(0x232A, '>'),
(0x266F, '#'),
(0x2731, '*'),
(0x2758, '|'),
(0x2762, '!'),
(0x27E6, '['),
(0x27E8, '<'),
(0x27E9, '>'),
(0x2983, '{'),
(0x2984, '}'),
(0x3003, '"'),
(0x3008, '<'),
(0x3009, '>'),
(0x301B, ']'),
(0x301C, '~'),
(0x301D, '"'),
(0x301E, '"'),
(0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) }
}
答案 3 :(得分:3)
虽然这不能完全回答您的问题,但您可以将Unicode文本转换为US-ASCII,用'?'替换非ASCII字符符号。
String input = "aáeéiíoóuú"; // 10 chars.
Charset ch = Charset.forName("US-ASCII");
CharsetEncoder enc = ch.newEncoder();
enc.onUnmappableCharacter(CodingErrorAction.REPLACE);
enc.replaceWith(new byte[]{'?'});
ByteBuffer out = null;
try {
out = enc.encode(CharBuffer.wrap(input));
} catch (CharacterCodingException e) {
/* ignored, shouldn't happen */
}
String outStr = ch.decode(out).toString();
// Prints "a?e?i?o?u?"
System.out.println(outStr);
答案 4 :(得分:2)
我为类似的替换所做的是创建一个Map
(通常是HashMap
),其中Unicode字符作为键,它们的替代值作为值。
伪爪哇; for
取决于您使用哪种字符容器作为执行此操作的方法的参数,例如String,CharSequence等。
StringBuilder output = new StringBuilder();
for (each Character 'c' in inputString)
{
Character replacement = xlateMap.get( c );
output.append( replacement != null ? replacement : c );
}
return output.toString();
地图中的任何内容都会被替换,不在地图中的任何内容都会保持不变并复制到输出中。
答案 5 :(得分:1)
这是一个做得很好的Python包。它基于Perl模块Text :: Unidecode。我认为这可以移植到Java。
http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/
答案 6 :(得分:0)
String lstring = "my string containing all different simbols";
lstring = lstring.replaceAll("\u2013", "-")
.replaceAll("\u2014", "-")
.replaceAll("\u2015", "-")
.replaceAll("\u2017", "_")
.replaceAll("\u2018", "\'")
.replaceAll("\u2019", "\'")
.replaceAll("\u201a", ",")
.replaceAll("\u201b", "\'")
.replaceAll("\u201c", "\"")
.replaceAll("\u201d", "\"")
.replaceAll("\u201e", "\"")
.replaceAll("\u2026", "...")
.replaceAll("\u2032", "\'")
.replaceAll("\u2033", "\"");