我正在尝试使用这个lib https://github.com/pytries/datrie来操纵中文文本。
但我遇到了一个问题 - 编码解码中文unicode有问题:
import datrie
text = htmls_2_text(input_dir)
trie = datrie.Trie(''.join(set(text))) # about 2221 unique chars
trie['今天天气真好'] = 111
trie['今天好'] = 222
trie['今天'] = 444
print(trie.items())
[('今义', 444), ('今义义傲兢于', 111), ('今义于', 222)]
唯一字符:https://pastebin.com/n2i280i8
结果是错误的,显然存在解码/编码错误。
然后我查看源代码https://github.com/pytries/datrie/blob/master/src/datrie.pyx
cdef cdatrie.AlphaChar* new_alpha_char_from_unicode(unicode txt):
"""
Converts Python unicode string to libdatrie's AlphaChar* format.
libdatrie wants null-terminated array of 4-byte LE symbols.
The caller should free the result of this function.
"""
cdef int txt_len = len(txt)
cdef int size = (txt_len + 1) * sizeof(cdatrie.AlphaChar)
# allocate buffer
cdef cdatrie.AlphaChar* data = <cdatrie.AlphaChar*> malloc(size)
if data is NULL:
raise MemoryError()
# Copy text contents to buffer.
# XXX: is it safe? The safe alternative is to decode txt
# to utf32_le and then use memcpy to copy the content:
#
# py_str = txt.encode('utf_32_le')
# cdef char* c_str = py_str
# string.memcpy(data, c_str, size-1)
#
# but the following is much (say 10x) faster and this
# function is really in a hot spot.
cdef int i = 0
for char in txt:
data[i] = <cdatrie.AlphaChar> char
i+=1
# Buffer must be null-terminated (last 4 bytes must be zero).
data[txt_len] = 0
return data
cdef unicode unicode_from_alpha_char(cdatrie.AlphaChar* key, int len=0):
"""
Converts libdatrie's AlphaChar* to Python unicode.
"""
cdef int length = len
if length == 0:
length = cdatrie.alpha_char_strlen(key)*sizeof(cdatrie.AlphaChar)
cdef char* c_str = <char*> key
return c_str[:length].decode('utf_32_le')
我尝试使用注释块txt.encode('utf_32_le')
来替换当前更快的技巧,更轻松的工作。
我没有看到此代码中有任何错误,问题是什么?
答案 0 :(得分:2)
看起来问题是这个datrie包最多支持键集中字符的255个值:https://github.com/pytries/datrie/blob/master/libdatrie/datrie/alpha-map.h#L59
我建议您从https://pypi.python.org/pypi/marisa-trie
使用marisa_trie::RecordTrie
不幸的是,它是一个静态数据结构,所以你不能在构建之后修改它,但它完全支持unicode,序列化到磁盘以及各种值类型。
>>> from marisa_trie import RecordTrie
>>> rt = RecordTrie(">I", [(u'今天天气真好', (111,)), (u'今天好', (222,)), (u'今天', (444,))])
>>> for x in rt.items():
... print x[0], x[1]
...
今天天气真好 (111,)
今天好 (222,)
今天 (444,)
(请注意,我在此示例中使用的是Python 2.7,因此u''
并在循环中打印。)
修改
如果你绝对必须使用datrie.Trie,你可以用一种相当愚蠢的方式利用它:
def encode(s):
return ''.join('%08x' % ord(x) for x in s)
def decode(s):
return ''.join(chr(int(s[n:n+8], 16)) for n in range(0, len(s), 8))
>>> trie = datrie.Trie('0123456789abcdef')
>>> trie[encode('今天天气真好')] = 111
>>> trie[encode('今天好')] = 222
>>> trie[encode('今天')] = 444
>>> [decode(x) for x in trie.keys()]
['今天', '今天天气真好', '今天好']
我使用了8,因为32是任何utf8编码字符的最大位宽。您可以通过计算max(ord(x) for x in text)
并将其用作填充来节省空间。或者您可以提出自己的编码方案,该方案最多使用255个char值。这只是一个非常快速和低效的解决方案。
当然,这种方式违背了使用特里的目的......