Question

我在Python 2.7.10的“窄”版本中有一个Unicode字符串，其中包含Unicode字符。我正在尝试使用该Unicode字符作为字典中的查询，但是当我对字符串进行索引以获取最后的Unicode字符时，它将返回另一个字符串：

reviews_df = spark.read.format("org.apache.spark.sql.cassandra")\
  .options(table="reviews", keyspace="yelp_data").load()
business_df = spark.read.format("org.apache.spark.sql.cassandra")\
  .options(table="business", keyspace="yelp_data").load()

为什么会这样，如何从字符串中检索>>> s = u'Python is fun \U0001f44d' >>> s[-1] u'\udc4d'？

编辑：'\U0001f44d'是5.2.0，而unicodedata.unidata_version是65535。

Answer 1

看起来像您的Python 2构建使用代理来表示基本多语言平面之外的代码点。参见例如How to work with surrogate pairs in Python?的背景知识。

我的建议是尽快使用Python 3处理涉及字符串处理的所有事情。

Answer 2

Python 2“窄”版本使用UTF-16存储Unicode字符串（即所谓的leaky abstraction，因此代码点> U + FFFF是两个UTF替代。要检索代码点，您必须得到领先和落后的代理人：

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]     # Just the trailing surrogate
u'\udc4d'
>>> s[-2:]    # leading and trailing
u'\U0001f44d'

切换到Python 3.3+即可解决问题，并且不会公开Unicode字符串中Unicode代码点的存储详细信息：

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]   # code points are stored in Unicode strings.
'\U0001f44d'

Python Unicode索引显示不同的字符

2 个答案: