我正在尝试确定熊猫数据框中两列的相似性:
Text1 All
Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist. Where am I?
我想将'Performance results ... '
与'The six...'
和'Accuracy is one...'
与'Where am I?'
进行比较。
第一行在两列之间应具有较高的相似度,因为它包含一些单词。第二列应等于0,因为两列之间没有共同的词。
要比较我使用的SequenceMatcher
的两列,如下:
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, df.Text1, df.All).ratio()
但是使用df.Text1, df.All
似乎是错误的。
你能告诉我为什么吗?
答案 0 :(得分:0)
SequenceMatcher
不是为熊猫系列设计的。.apply
的功能。SequenceMatcher
Examples
isjunk=None
,即使空格也不被视为垃圾邮件。isjunk=lambda y: y == " "
会将空格视为垃圾。from difflib import SequenceMatcher
import pandas as pd
data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}
df = pd.DataFrame(data)
# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.356164
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.088235
# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.410959
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.117647