如何可能,使用相同的输入我有时会出现ascii codec
错误,有时候它可以正常工作?代码会清除名称并构建其Soundex
和DMetaphone
值。它可以在5次运行中运行,有时更常见:)
UPD:看起来这是fuzzy.DMetaphone
的问题,至少在使用Unicode的Python2.7上是这样。现在计划集成Metaphone。 fuzzy.DMetaphone
问题的所有解决方案都非常受欢迎:)
UPD 2: fuzzy
更新到1.2.2后问题消失了。相同的代码工作正常。
import re
import fuzzy
import sys
def make_namecard(full_name):
soundex = fuzzy.Soundex(4)
dmeta = fuzzy.DMetaphone(4)
names = process_name(full_name)
print names
soundexes = map(soundex, names)
dmetas = []
for name in names:
print name
dmetas.extend(list(dmeta(name)))
dmetas = filter(bool, dmetas)
return {
"full_name": full_name,
"soundex": soundexes,
"dmeta": dmetas,
"names": names,
}
def process_name(full_name):
full_name = re.sub("[_-]", " ", full_name)
full_name = re.sub(r'[^A-Za-z0-9 ]', "", full_name)
names = full_name.split()
names = filter(valid_name, names)
return names
def valid_name(name):
COMMON_WORDS = ["the", "of"]
return len(name) >= 2 and name.lower() not in COMMON_WORDS
print make_namecard('Jerusalem Warriors')
输出:
➜ python2.7 make_namecard.py
['Jerusalem', 'Warriors']
Jerusalem
Warriors
{'soundex': [u'J624', u'W624'], 'dmeta': [u'\x00\x00\x00\x00', u'ARSL', u'ARRS', u'FRRS'], 'full_name': 'Jerusalem Warriors', 'names': ['Jerusalem', 'Warriors']}
➜ python2.7 make_namecard.py
['Jerusalem', 'Warriors']
Jerusalem
Traceback (most recent call last):
File "make_namecard.py", line 38, in <module>
print make_namecard('Jerusalem Warriors')
File "make_namecard.py", line 16, in make_namecard
dmetas.extend(list(dmeta(name)))
File "src/fuzzy.pyx", line 258, in fuzzy.DMetaphone.__call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)