有没有办法在java中的字符集之间实现音译字符?类似于unix命令(或类似的php函数):
iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt > new_doc.txt
最好在字符串上操作,与文件没有任何关系
我知道您可以使用String
构造函数更改编码,但这不会处理不在生成的字符集中的字符的音译。
答案 0 :(得分:11)
我不知道有哪些库完全符合iconv
的意图(似乎没有很好地定义)。但是,您可以在Java中使用"normalization"来执行删除字符重音的操作。这个过程由Unicode标准很好地定义。
我认为NFKD(兼容性分解)后面是非ASCII字符的过滤可能会让你接近你想要的。显然,这是一个有损的过程;你永远无法恢复原始字符串中的所有信息,所以要小心。
/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
char ch = decomposed.charAt(idx);
if (ch < 128)
buf.append(ch);
}
String filtered = buf.toString();
使用此处使用的过滤,您可能会渲染一些不可读的字符串。例如,一串中文字符将被完全过滤掉,因为它们都没有ASCII表示(这更像是iconv的//IGNORE
)。
总的来说,构建自己的有效字符替换查找表会更安全,或至少组合可以安全剥离的字符(重音和事物)。最佳解决方案取决于您希望处理的输入字符范围。
答案 1 :(得分:4)
让我们从Ericson的答案略有变化开始,并在其上构建更多//TRANSLIT
个功能:
String
public class Translit {
private static final Charset US_ASCII = Charset.forName("US-ASCII");
private static String toAscii(final String input) {
final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
final StringBuilder sb = new StringBuilder(decomposed.length);
for (int i = 0; i < decomposed.length; ) {
final int codePoint = Character.codePointAt(decomposed, i);
final int charCount = Character.charCount(codePoint);
if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
sb.append(decomposed, i, charCount);
}
i += charCount;
}
return sb.toString();
}
public static void main(String[] args) {
final String a = "Michèleäöüß";
System.out.println(a + " => " + toAscii(a));
System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
}
}
虽然US-ASCII的行为应该相同,但对于不同的目标编码,此解决方案更容易采用。 (因为字符首先被分解,但这并不一定会为其他编码产生更好的结果)
该功能对于补充代码点是安全的(对于ASCII作为目标来说有点过分,但如果选择了另一个目标编码,可能会减少头痛)。
另请注意,返回常规Java-String;如果您需要ASCII - byte[]
,您仍然需要转换它(但是我们确保没有违规字符......)。
这就是你可以将它扩展到更多字符集的方法:
String
Charset
可编码
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.text.Normalizer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
/**
* Created for http://stackoverflow.com/a/22841035/1266906
*/
public class Translit {
public static final Charset US_ASCII = Charset.forName("US-ASCII");
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
.put('“', '"')
.put('”', '"')
.put('″', '"')
.put('€', "EUR")
.put('ß', "ss")
.put('•', '*')
.getMap();
private static String toCharset(final String input, Charset charset) {
return toCharset(input, charset, Collections.<Integer, String>emptyMap());
}
private static String toCharset(final String input,
Charset charset,
Map<? super Integer, ? extends String> replacements) {
final CharsetEncoder charsetEncoder = charset.newEncoder();
return toCharset(input, charsetEncoder, replacements);
}
private static String toCharset(String input,
CharsetEncoder charsetEncoder,
Map<? super Integer, ? extends String> replacements) {
char[] data = input.toCharArray();
final StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ) {
final int codePoint = Character.codePointAt(data, i);
final int charCount = Character.charCount(codePoint);
CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
if (charsetEncoder.canEncode(charBuffer)) {
sb.append(data, i, charCount);
} else if (replacements.containsKey(codePoint)) {
sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
} else {
// Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
for (int j = 0; j < decomposed.length; ) {
int decomposedCodePoint = Character.codePointAt(decomposed, j);
int decomposedCharCount = Character.charCount(decomposedCodePoint);
if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
sb.append(decomposed, j, decomposedCharCount);
} else if (replacements.containsKey(decomposedCodePoint)) {
sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
}
j += decomposedCharCount;
}
}
i += charCount;
}
return sb.toString();
}
public static void main(String[] args) {
final String a = "Michèleäöü߀„“”″•";
System.out.println(a + " => " + toCharset(a, US_ASCII));
System.out.println(a + " => " + toCharset(a, ISO_8859_1));
System.out.println(a + " => " + toCharset(a, UTF_8));
System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
}
public static class MapBuilder<K, V> {
private final HashMap<K, V> map;
public MapBuilder() {
map = new HashMap<K, V>();
}
public MapBuilder<K, V> put(K key, V value) {
map.put(key, value);
return this;
}
public HashMap<K, V> getMap() {
return map;
}
}
public static class ReplacementBuilder extends MapBuilder<Integer, String> {
public ReplacementBuilder() {
super();
}
@Override
public ReplacementBuilder put(Integer input, String replacement) {
super.put(input, replacement);
return this;
}
public ReplacementBuilder put(Integer input, char replacement) {
return this.put(input, String.valueOf(replacement));
}
public ReplacementBuilder put(char input, String replacement) {
return this.put((int) input, replacement);
}
public ReplacementBuilder put(char input, char replacement) {
return this.put((int) input, String.valueOf(replacement));
}
}
}
我强烈建议建立一个广泛的替换表,因为简单的例子已经显示了你可能会丢失所需的信息,如€
。对于ASCII,这种实现当然有点慢,因为分解只是按需进行,StringBuilder
现在可能需要增长来保存替换。
GNU的iconv使用translit.def中列出的替换来执行//TRANSLIT
转换,如果您想将其用作替换地图,则可以使用此类方法:
//TRANSLIT
- 替换private static Map<Integer, String> readReplacements() {
HashMap<Integer, String> map = new HashMap<>();
InputStream stream = Translit.class.getResourceAsStream("/translit.def");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
try {
String line;
while ((line = bufferedReader.readLine()) != null) {
if (line.charAt(0) != '#') {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
return map;
}
答案 2 :(得分:3)
一种解决方案是将执行iconv作为外部进程执行。它肯定会冒犯纯粹主义者。这取决于系统上iconv的存在,但它可以正常工作并完成您想要的操作:
public static String utfToAscii(String input) throws IOException {
Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
bwo.write(input,0,input.length());
bwo.flush();
bwo.close();
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
while( ( line = bri.readLine() ) != null ) {
stringBuilder.append( line );
stringBuilder.append( ls );
}
bri.close();
try {
p.waitFor();
} catch ( InterruptedException e ) {
}
return stringBuilder.toString();
}