比较熊猫两列中的字符串

时间:2020-08-12 18:30:48

标签: python pandas nlp sequencematcher

我正在尝试确定熊猫数据框中两列的相似性:

Text1                                                                             All
Performance results achieved by the approaches submitted to this Challenge.       The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist.                             Where am I?

我想将'Performance results ... ''The six...'和'Accuracy is one...''Where am I?'进行比较。 第一行在两列之间应具有较高的相似度,因为它包含一些单词。第二列应等于0,因为两列之间没有共同的词。

要比较我使用的SequenceMatcher的两列,如下:

from difflib import SequenceMatcher

ratio = SequenceMatcher(None, df.Text1, df.All).ratio()

但是使用df.Text1, df.All似乎是错误的。

你能告诉我为什么吗?

1 个答案:

答案 0 :(得分:0)

  • SequenceMatcher不是为熊猫系列设计的。
  • 您可以.apply的功能。
  • SequenceMatcher Examples
    • 对于isjunk=None,即使空格也不被视为垃圾邮件。
    • 使用isjunk=lambda y: y == " "会将空格视为垃圾。
from difflib import SequenceMatcher
import pandas as pd

data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
        'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}

df = pd.DataFrame(data)

# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.356164
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.088235

# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.410959
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.117647
相关问题