我有一些字符串,需要将各种字符写入Google BigQuery,这需要严格的UTF8字符串。尝试用各种各样的表情符号输入编写字符串时,出现错误:
java.lang.IllegalArgumentException: Unpaired surrogate at index 3373
at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLengthGeneral(Utf8.java:93)
at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLength(Utf8.java:67)
at org.apache.beam.sdk.coders.StringUtf8Coder.getEncodedElementByteSize(StringUtf8Coder.java:145)
...
我有一个解决此问题的方法,可以简单地从字符串中删除所有替代字符:
private static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char c = query.charAt(i);
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(c);
}
}
return sb.toString();
}
但是,这会导致类似
的字符串⚔⌨⛳⛏
减少到只有四个表情符号
⚔⌨⛳⛏
是否存在正确的方法将这些字符转换为UTF8而又不会丢失,并且不使用未配对的替代项?
(抱歉,我对字符集的理解一般都不是很好)
答案 0 :(得分:2)
I found the problem. We are using org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 to convert HTML entities in strings to their non-encoded forms. This seems to mangle some non-latin characters. For example, passing the string "Italien " through this method converts it into "Italien ?" (the last character gets mangled)
Passing "⚔⌨⛳⛏" through this method converts it to "????????⚔⌨?⛳???"
import org.apache.commons.lang3.StringEscapeUtils;
public class CharacterTest {
public static void main(String[] args) {
String good = "⚔⌨⛳⛏";
String bad = StringEscapeUtils.unescapeHtml4(good);
System.out.println(good + "->" + bad);
}
}
⚔⌨⛳⛏->????????⚔⌨?⛳???
Now to find an alternative HTML entity decoder...
答案 1 :(得分:0)
Is there a proper way to convert these characters into UTF8
可能是,如果您只发送字符串,它将被转换为UTF-8。这就是Java编码器的工作方式。
如果没有,并且您正在发送二进制文件,则可以直接进行转换:
private static byte[] removeSurrogates(String query) {
return query.getBytes( "UTF-8" );
}
答案 2 :(得分:0)