我有这样的数据
name name in another column
-------------------------------
raju vasu
ramana seshu
seshu ramana
我想计算这些列之间的相似度
raju
* vasu
相似性
像这样,我想获得每一行的相似度得分
name name in another column similarity
-------------------------------------------
raju vasu 0.1
ramana seshu 0.2
seshu ramana 0
答案 0 :(得分:0)
This post可能会回答您的问题。
简短示例代码
from difflib import SequenceMatcher
names_a = ["raju", "ramana", "seshu"]
names_b = ["vasu", "seshu", "ramana"]
similar = [SequenceMatcher(None, a, b).ratio() for a,b in zip(names_a, names_b)]
输出:
In [7]: similar
Out[7]: [0.5, 0.0, 0.0]
答案 1 :(得分:0)
fuzzywuzzy模块可用于字符串匹配
例如
>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
97
>>> fuzz.partial_ratio("this is a test", "this is a test!")
100
有关更多详细信息,请访问https://pypi.org/project/fuzzywuzzy/
答案 2 :(得分:0)
模糊[wuzzy]可以很好地执行您想要的操作,但是如果数据集中有很多行,则非常慢。
我会使用sklearn中的矢量化器(例如TfidfVectorizer)将向量转换为字符串,然后以cosine_similarity的形式(同样来自sklearn)传递