Question

我正在尝试使用这个lib https://github.com/pytries/datrie来操纵中文文本。

但我遇到了一个问题 - 编码解码中文unicode有问题：

import datrie
text = htmls_2_text(input_dir)
trie = datrie.Trie(''.join(set(text))) # about 2221 unique chars
trie['今天天气真好'] = 111
trie['今天好'] = 222
trie['今天'] = 444

print(trie.items())

[('今义', 444), ('今义义傲兢于', 111), ('今义于', 222)]

唯一字符：https://pastebin.com/n2i280i8

结果是错误的，显然存在解码/编码错误。

然后我查看源代码https://github.com/pytries/datrie/blob/master/src/datrie.pyx

cdef cdatrie.AlphaChar* new_alpha_char_from_unicode(unicode txt):
    """
    Converts Python unicode string to libdatrie's AlphaChar* format.
    libdatrie wants null-terminated array of 4-byte LE symbols.
    The caller should free the result of this function.
    """
    cdef int txt_len = len(txt)
    cdef int size = (txt_len + 1) * sizeof(cdatrie.AlphaChar)

    # allocate buffer
    cdef cdatrie.AlphaChar* data = <cdatrie.AlphaChar*> malloc(size)
    if data is NULL:
        raise MemoryError()

    # Copy text contents to buffer.
    # XXX: is it safe? The safe alternative is to decode txt
    # to utf32_le and then use memcpy to copy the content:
    #
    #    py_str = txt.encode('utf_32_le')
    #    cdef char* c_str = py_str
    #    string.memcpy(data, c_str, size-1)
    #
    # but the following is much (say 10x) faster and this
    # function is really in a hot spot.
    cdef int i = 0
    for char in txt:
        data[i] = <cdatrie.AlphaChar> char
        i+=1

    # Buffer must be null-terminated (last 4 bytes must be zero).
    data[txt_len] = 0
    return data


cdef unicode unicode_from_alpha_char(cdatrie.AlphaChar* key, int len=0):
    """
    Converts libdatrie's AlphaChar* to Python unicode.
    """
    cdef int length = len
    if length == 0:
        length = cdatrie.alpha_char_strlen(key)*sizeof(cdatrie.AlphaChar)
    cdef char* c_str = <char*> key
    return c_str[:length].decode('utf_32_le')

我尝试使用注释块txt.encode('utf_32_le')来替换当前更快的技巧，更轻松的工作。

我没有看到此代码中有任何错误，问题是什么？

Answer 1

看起来问题是这个datrie包最多支持键集中字符的255个值：https://github.com/pytries/datrie/blob/master/libdatrie/datrie/alpha-map.h#L59

我建议您从https://pypi.python.org/pypi/marisa-trie

使用marisa_trie::RecordTrie

不幸的是，它是一个静态数据结构，所以你不能在构建之后修改它，但它完全支持unicode，序列化到磁盘以及各种值类型。

>>> from marisa_trie import RecordTrie
>>> rt = RecordTrie(">I", [(u'今天天气真好', (111,)), (u'今天好', (222,)), (u'今天', (444,))])
>>> for x in rt.items():
...     print x[0], x[1]
...
今天天气真好 (111,)
今天好 (222,)
今天 (444,)

（请注意，我在此示例中使用的是Python 2.7，因此u''并在循环中打印。）

修改

如果你绝对必须使用datrie.Trie，你可以用一种相当愚蠢的方式利用它：

def encode(s):
    return ''.join('%08x' % ord(x) for x in s)

def decode(s):
    return ''.join(chr(int(s[n:n+8], 16)) for n in range(0, len(s), 8))

>>> trie = datrie.Trie('0123456789abcdef')
>>> trie[encode('今天天气真好')] = 111
>>> trie[encode('今天好')] = 222
>>> trie[encode('今天')] = 444
>>> [decode(x) for x in trie.keys()]
['今天', '今天天气真好', '今天好']

我使用了8，因为32是任何utf8编码字符的最大位宽。您可以通过计算max(ord(x) for x in text)并将其用作填充来节省空间。或者您可以提出自己的编码方案，该方案最多使用255个char值。这只是一个非常快速和低效的解决方案。

当然，这种方式违背了使用特里的目的......

Cython将字符串转换为unicode

1 个答案: