Question

假设我有一堆UTF-8文件，我用unicode发送到外部API。 API对每个unicode字符串进行操作，并返回一个包含(character_offset, substr)元组的列表。

我需要的输出是每个找到的子字符串的开始和结束字节偏移量。如果我很幸运，输入文本只包含ASCII字符（使字符偏移和字节偏移相同），但情况并非总是如此。如何找到已知开始字符偏移量和子字符串的开始和结束字节偏移量？

我自己已经回答了这个问题，但期待这个问题的其他解决方案更强大，更高效，和/或更具可读性。

Answer 1

我使用字典将字符偏移映射到字节偏移，然后在其中查找偏移量来解决这个问题。

def get_char_to_byte_map(unicode_string):
    """
    Generates a dictionary mapping character offsets to byte offsets for unicode_string.
    """
    response = {}
    byte_offset = 0
    for char_offset, character in enumerate(unicode_string):
        response[char_offset] = byte_offset
        byte_offset += len(character.encode('utf-8'))
    return response

char_to_byte_map = get_char_to_byte_map(text)

for begin_offset, substring in api_response:
    begin_offset = char_to_byte_map[character_offset]
    end_offset = char_to_byte_map[character_offset + len(substring)]
    # do something

与您的解决方案相比，此解决方案的性能在很大程度上取决于输入的大小和所涉及的子串的数量。本地微基准测试表明，对文本中的每个单独字符进行编码所需的时间大约是对整个文本进行一次编码的1000倍。

Answer 2

要在需要时将字符偏移转换为字节偏移，如果输入文本中有任何非ASCII字符，则encode('utf8')前导到找到的子字符串的文本，并将其长度作为开始偏移量。

# Check if text contains non-ASCII characters
needs_offset_conversion = len(text) != len(text.encode('utf8'))

def get_byte_offsets(text, character_offset, substr, needs_conversion):
    if needs_conversion:
        begin_offset = len(text[:character_offset].encode('utf8'))
        end_offset = begin_offset + len(substr.encode('utf8'))
    else:
        begin_offset = character_offset
        end_offset = character_offset + len(substr)
    return begin_offset, end_offset

此实现有效，但它为每个找到的子字符串编码（大）部分文本。

将字符偏移转换为字节偏移（在Python中）

2 个答案: