我有以下数据框:
df = pd.DataFrame(
{'id': [1, 2, 3, 4, 5, 6],
'fruits': ['apple', 'apples', 'orange', 'apple tree', 'oranges', 'mango']
})
id fruits
0 1 apple
1 2 apples
2 3 orange
3 4 apple tree
4 5 oranges
5 6 mango
我希望在列fruits
中找到模糊字符串,并按如下方式获得一个新的数据帧,该数据帧的ratio_score高于80。
在Python中如何使用Fuzzywuzzy软件包做到这一点?谢谢。请注意,ratio_score
是一系列构成示例的值。
我的解决方案:
df.loc[:,'fruits_copy'] = df['fruits']
df['ratio_score'] = df[['fruits', 'fruits_copy']].apply(lambda row: fuzz.ratio(row['fruits'], row['fruits_copy']), axis=1)
预期结果:
id fruits matched_id matched_fruits ratio_score
0 1 apple 2 apples 95
1 1 apple 4 apple tree 85
2 2 apples 4 apple tree 80
3 3 orange 5 oranges 95
4 6 mango
与参考相关:
Fuzzy matching a sorted column with itself using python
Apply fuzzy matching across a dataframe column and save results in a new column
How do I fuzzy match items in a column of an array in python?
Using fuzzywuzzy to create a column of matched results in the data frame
答案 0 :(得分:0)
我的解决方案,其引用如下:Apply fuzzy matching across a dataframe column and save results in a new column
df.loc[:,'fruits_copy'] = df['fruits']
compare = pd.MultiIndex.from_product([df['fruits'],
df['fruits_copy']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
ratio token
apple apple 100 100
apples 91 91
orange 36 36
apple tree 67 67
oranges 33 33
mango 20 20
apples apple 91 91
apples 100 100
orange 33 33
apple tree 62 62
oranges 46 46
mango 18 18
orange apple 36 36
apples 33 33
orange 100 100
apple tree 25 25
oranges 92 92
mango 55 55
apple tree apple 67 67
apples 62 62
orange 25 25
apple tree 100 100
oranges 24 24
mango 13 13
oranges apple 33 33
apples 46 46
orange 92 92
apple tree 24 24
oranges 100 100
mango 50 50
mango apple 20 20
apples 18 18
orange 55 55
apple tree 13 13
oranges 50 50
mango 100 100