我正在尝试为以下pandas DataFrame
计算Levenshtein distance。我正在使用this包。
In [22]: df = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["abc,cde,eg,ba","abc,cde,ba","abc,yz,zx,eg","abc,cde,eg,ba","abc,cde","abc","cde,eg,ba"]})
In [23]: df
Out[23]:
id path
0 1 abc,cde,eg,ba
1 2 abc,cde,ba
2 3 abc,yz,zx,eg
3 4 abc,cde,eg,ba
4 5 abc,cde
5 6 abc
6 7 cde,eg,ba
以下是我的实施。
In [18]: d = {'abc':'1', 'cde':'2', 'eg':'3', 'ba':'4', 'yz':'5', 'zx':'6'}
In [19]: d
Out[19]: {'abc': '1', 'ba': '4', 'cde': '2', 'eg': '3', 'yz': '5', 'zx': '6'}
In [20]: a = [jellyfish.levenshtein_distance(*map(d.get, item)) for item in itertools.combinations(d,2)]
In [21]: a
Out[21]: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
为什么不按如下方式比较字符串?为什么只打印1?
In [22]: list(itertools.combinations(d,2))
Out[22]:
[('cde', 'abc'),
('cde', 'ba'),
('cde', 'eg'),
('cde', 'yz'),
('cde', 'zx'),
('abc', 'ba'),
('abc', 'eg'),
('abc', 'yz'),
('abc', 'zx'),
('ba', 'eg'),
('ba', 'yz'),
('ba', 'zx'),
('eg', 'yz'),
('eg', 'zx'),
('yz', 'zx')]
答案 0 :(得分:0)
列表理解似乎没有正确设置。我并不真正了解您的DataFrame与实现之间的关系,但似乎您的实现中的列表理解并没有达到您的预期。以下是你想要的吗?
a = [jf.levenshtein_distance(x[0], x[1]) for x in itertools.combinations(d,2)]