应用错误收集

我想在python中编写一个程序，它以一种java应用程序可以读取的方式将字符串写入文件。此java应用程序以修改的UTF-8格式读取字符串（请参阅this question）。它基本上意味着所有unicode字符都必须写为代理字符，即UTF-16BE。是否有一种优雅，有效的方法可以做到这一点。

我尝试过以下操作：

import re;
o_string = b''
for splitter in re.split("([\U00010000-\U0010FFFF])", w_string):
    if len(splitter) == 1 and ord(splitter) > 0xFFFF:
        o_string += splitter.encode("UTF-16")
    else:
        o_string += splitter.decode("UTF-8")
print(o_string)

但这看起来很笨重（也使用正则表达式，似乎很容易躲闪）。有人可以提出更好的解决方案吗？

Unicode：自动检测高字符并将其转换为代理对

0 个答案: