Question

我有一些字符串，需要将各种字符写入Google BigQuery，这需要严格的UTF8字符串。尝试用各种各样的表情符号输入编写字符串时，出现错误：

java.lang.IllegalArgumentException: Unpaired surrogate at index 3373
    at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLengthGeneral(Utf8.java:93)
    at org.apache.beam.sdk.repackaged.com.google.common.base.Utf8.encodedLength(Utf8.java:67)
    at org.apache.beam.sdk.coders.StringUtf8Coder.getEncodedElementByteSize(StringUtf8Coder.java:145)
...

我有一个解决此问题的方法，可以简单地从字符串中删除所有替代字符：

    private static String removeSurrogates(String query) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < query.length(); i++) {
            char c = query.charAt(i);
            if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
                sb.append(c);
            }
        }
        return sb.toString();
    }

但是，这会导致类似

的字符串

⚔⌨⛳⛏

减少到只有四个表情符号

⚔⌨⛳⛏

是否存在正确的方法将这些字符转换为UTF8而又不会丢失，并且不使用未配对的替代项？

（抱歉，我对字符集的理解一般都不是很好）

Answer 1

I found the problem. We are using org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 to convert HTML entities in strings to their non-encoded forms. This seems to mangle some non-latin characters. For example, passing the string "Italien " through this method converts it into "Italien ?" (the last character gets mangled)

Passing "⚔⌨⛳⛏" through this method converts it to "????????⚔⌨?⛳???"

import org.apache.commons.lang3.StringEscapeUtils;

public class CharacterTest {
    public static void main(String[] args) {
        String good = "⚔⌨⛳⛏";
        String bad = StringEscapeUtils.unescapeHtml4(good);
        System.out.println(good + "->" + bad);
    }
}

⚔⌨⛳⛏->????????⚔⌨?⛳???

Now to find an alternative HTML entity decoder...

Answer 2

Is there a proper way to convert these characters into UTF8可能是，如果您只发送字符串，它将被转换为UTF-8。这就是Java编码器的工作方式。

如果没有，并且您正在发送二进制文件，则可以直接进行转换：

private static byte[] removeSurrogates(String query) {
    return query.getBytes( "UTF-8" );
}

Answer 3

让我离开Java一秒钟，以证明BigQuery可以处理表情符号：

CREATE TABLE `public_dump.emoji_test`
AS
SELECT "⚔⌨⛳⛏" emojis

然后测试是否存在：

SELECT COUNT(*)
FROM `fh-bigquery.public_dump.emoji_test`
WHERE emojis LIKE '%%'

1

使用Python这样做很简单：

插入新数据也不是问题：

很抱歉，我不知道如何使用Java来解决此问题，但我希望看到BigQuery的API能够优雅地处理表情符号的API证明。

如何在没有未配对代理字符的情况下将诸如表情符号之类的字符编码为UTF8？

3 个答案: