我有一个数据帧df,如下所示:
a b
0 Jon Jon
1 Jon John
2 Jon Johnny
我想将这两个字符串进行比较,并新建一个这样的列:
df['compare'] = df2['a'] = df2['b']
a b compare
0 Jon Jon True
1 Jon John False
2 Jon Johnny False
我还希望能够通过levenshtein函数传递列a和b:
def levenshtein_distance(a, b):
"""Return the Levenshtein edit distance between two strings *a* and *b*."""
if a == b:
return 0
if len(a) < len(b):
a, b = b, a
if not a:
return len(b)
previous_row = range(len(b) + 1)
for i, column1 in enumerate(a):
current_row = [i + 1]
for j, column2 in enumerate(b):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (column1 != column2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
并添加如下所示的列:
df['compare'] = levenshtein_distance(df2['a'], df2['b'])
a b compare
0 Jon Jon 100
1 Jon John .95
2 Jon Johnny .87
但是尝试时出现此错误:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
如何格式化我的数据/数据框以允许它比较两列,并将比较添加到第三列?
答案 0 :(得分:3)
只需:
df['compare'] = [levenshtein_distance(a, b) for a, b in zip(df2['a'], df2['b'])]
或者,如果要进行相等比较:
df['compare'] = (df['a'] == df['b'])
答案 1 :(得分:0)
我认为您比较是错误的,请更改:
更改:
if a == b
and not a
到
if a[0] == b[0]
and
not a[0]
,您将看到您的函数有效,它只需要遍历所传递的df。如果返回列表,则等于将返回
这是一个有效的版本:
def levenshtein_distance(a, b):
"""Return the Levenshtein edit distance between two strings *a* and *b*."""
y = len(a)
thelist = []
for x in range(0, y):
c = a[x]
d = b[x]
if c == d:
thelist.append(0)
continue
if len(c) < len(d):
c, d = d, c
if not c:
thelist.append(len(d))
continue
previous_row = range(len(d) + 1)
for i, column1 in enumerate(c):
current_row = [i + 1]
for j, column2 in enumerate(d):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (column1 != column2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
thelist.append(previous_row[-1])
return thelist
df['compare'] = levenshtein_distance(df.a, df.b)
df
# a b compare
#0 Jon Jon 0
#1 Jon John 1
#2 Jon Johnny 3
它只是不计算百分比,仅使用您的代码,根据Levenshtein Calc,这是正确的答案