最近我正在学习python并且在python 3中遇到了unicode escape literal的问题。
似乎像Java一样,\ u转义被解释为Java使用的UTF-16代码点,但问题出现了:
例如,如果我尝试以“♬”(https://unicode-table.com/en/266C/)或甚至像“https://unicode-table.com/en/2070E/”这样的补充unicode字符添加3字节的utf-8字符,格式为\ uXXXX或者\ UXXXXXXXX在正常字符串中如下:
print('\u00E2\u99AC') # UTF-8, messy code for sure
print('\U00E299AC') # UTF-8, with 8 bytes \U, (unicode error) for sure
print('\u266C') # UTF-16 BE, music note appeares
# from which I suppose \u and \U function the same way they should do in Java
# (may be a little different since they function like macro in Java, and can be useed in comments)
# However, while print('\u266C') gives me '♬','\u266C' == '♬' is equal to false
# which is true in Java semantics.
# Further more, print('\UD841DF0E') didn't give me '' : (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
# which I suppose it should be, so it appears to me that I may get it wrong
# Here again : print('\uD841\uDF0E') # Error, 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
print('\xD8\x41\xDF\x0E') # also tried this, messy code
# maybe UTF-16 LE?
print('\u41D8\u0EDF') # messy code
print('\U41D80EDF') # error
所以,我可以看到python“不支持补充逃避文字”,它的行为也很奇怪。
好吧,我已经知道解码和编码这些字符的正确方法了:
s_decoded = '\\xe2\\x99\\xac'.encode().decode('unicode-escape')\
.encode('latin-1').decode('utf-8')
print(b'\xf0\xa0\x9c\x8e'.decode('utf-8'))
print(b'\xd8\x41\xdf\x0e'.decode('utf-16 be'))
assert s_decoded == '♬'
但仍然没有得到如何正确使用\ u& \ U逃脱字面。希望有人可以指出我做错了什么以及它与Java的方式有什么不同,谢谢!
顺便说一句,我的环境是PyCharm win,python 3.6.1,源代码编码为UTF-8
答案 0 :(得分:1)
Python 3.6.3:
>>> print('\u266c') # U+266C
♬
>>> print('\U0002070E') # U+2070E. Python is not Java
>>> '\u266c' == '♬'
True
>>> '\U0002070E' == ''
True