熊猫:使用Levenshtein距离查询

时间:2017-08-29 10:38:47

标签: python string pandas levenshtein-distance

给出以下DataSet:

name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50

和约束:

source = "john"
max_dist = 2

我的目标是获得list所有名称值为Levenshtein Distancesource<= max_dist的{​​{1}}。是否可以使用pandas.DataFrame.query()方法执行此操作,或者必须以不同的方式完成此操作?

1 个答案:

答案 0 :(得分:3)

你会以不同的方式做到这一点。

import editdistance # first do pip install editdistance
from StringIO import StringIO

s = StringIO("""name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;26
jenny;female;boston2;30
mattia;na;BostonDynamics;50""")

df = pd.read_csv(s, sep=';')

df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)]

   name   sex     city  age
0  john  male  newyork   20


df[df.name.apply(lambda x: int(editdistance.eval(source, x)) <= 2)].name.tolist()

['john']