Question

我正在使用Elasticsearch和Python客户端，我对unicode，ES，分析器和表情符号之间的交互有疑问。当我尝试通过ES分析器运行包含表情符号字符的unicode文本字符串时，它似乎搞砸了结果输出中的术语偏移量。

例如：

>> es.indices.analyze(body=u'\U0001f64f testing')
{u'tokens': [{u'end_offset': 10,
   u'position': 1,
   u'start_offset': 3,
   u'token': u'testing',
   u'type': u'<ALPHANUM>'}]}

这给了我测试一词的错误补偿。

>> u'\U0001f64f testing'[3:10]
u'esting'

如果我使用另一个unicode外国字符（例如日元符号），我不会得到同样的错误。

>> es.indices.analyze(body=u'\u00A5 testing')
{u'tokens': [{u'end_offset': 9,
   u'position': 1,
   u'start_offset': 2,
   u'token': u'testing',
   u'type': u'<ALPHANUM>'}]}

>> u'\u00A5 testing'[2:9]
u'testing'

有人可以解释发生了什么吗？

Answer 1

Python 3.2或更早版本？在Windows上的Python 3.3之前，存在窄而宽的Unicode构建。窄版本每个字符使用两个字节，并使用UTF-16在内部编码Unicode代码点，这需要两个UTF-16代理来编码U + FFFF以上的Unicode代码点。

Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len('\U0001f64f')
1
>>> '\U0001f64f'[0]
'\U0001f64f'

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001f64f')
2
>>> u'\U0001f64f'[0]
u'\ud83d'
>>> u'\U0001f64f'[1]
u'\ude4f'

但是，在您的情况下，偏移是正确的。因为U + 1F64F使用两个UTF-16代理，“t”的偏移量为3.我不确定你是如何输出的：

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'\U0001f64f testing'
>>> x
u'\U0001f64f testing'
>>> x[3:10]
u'testing'
>>> y = u'\u00a5 testing'
>>> y[2:9]
u'testing'

Answer 2

我遇到了完全相同的问题，并设法通过在 UTF-16 中来回编码来正确映射偏移：

TEXT = "? carrot"
TOKENS = es.indices.analyze(body=TEXT)["tokens"]
# [
#   {
#     "token" : """?""",
#     "start_offset" : 0,
#     "end_offset" : 2,
#     "type" : "<EMOJI>",
#     "position" : 0
#   },
#   {
#     "token" : "carrot",
#     "start_offset" : 3,
#     "end_offset" : 9,
#     "type" : "<ALPHANUM>",
#     "position" : 1
#   }
# ]

ENCODED_TEXT = text.encode("utf-16")
# b'\xff\xfe>\xd8U\xdd \x00c\x00a\x00r\x00r\x00o\x00t\x00'
BOM_MARK_OFFSET = 2

def get_decoded_token(encoded_text, token):
    start_offset = (token["start_offset"] * 2) + BOM_MARK_OFFSET
    end_offset = (token["end_offset"] * 2) + BOM_MARK_OFFSET
    return encoded_text[start_offset:end_offset].decode("utf-16")

assert get_decoded_token(ENCODED_TEXT, TOKENS[0]) == "?"
assert get_decoded_token(ENCODED_TEXT, TOKENS[1]) == "carrot"

对于 BOM_OFFSET_MARK，请参阅 https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16

Elasticsearch Python表情符号和分析器中的术语偏移量

2 个答案: