Question

我有一些文字，处理它并找到文本中某些单词的偏移量。这些偏移量将被另一个应用程序使用，并且该应用程序使用文本和字节序列一起操作，因此str索引对它来说是错误的。

示例：

>>> text = "“Hello there!” He said"
>>> text[7:12]
'there'
>>> text.encode('utf-8')[7:12]
>>> b'o the'

那么如何将string中的索引转换为编码的bytearray中的索引？

Answer 1

对子字符串进行编码并以字节为单位获取长度：

text = "“Hello there!” He said"
start = len(text[:7].encode('utf-8'))
count = len(text[7:12].encode('utf-8'))
text.encode('utf-8')[start:start+count]

这会给b'there'。

Answer 2

这个应该有效：

def byte_array_index(s, str_index): 
    return len(s[:str_index].encode('utf-8'))