按一列分组并比较文本相似性熊猫

时间:2020-04-27 21:40:02

标签: python pandas sentence-similarity

我有一个包含近一百万行和一列的数据框,如下所示:

  VIN                  Complaints                                  Repairs  Key
 12234          Customer states engine issues                        yes     1
 12234          Car wont start. Engine broke down                    no      2
 12234          Vehicle battery was replaced                         yes     3
 12231          Car shut down, battery problem                       no      4
 12231          Cool and hot air coming from ac                      yes     5
 12231          Issue with temperature moderator, ac replaced        yes     6
 12231          air conditioner not working fine                     no      7

我希望按“ VIN”对df分组,并将“投诉”的成对文本相似性与“维修”的第一个“是”进行比较。

例如,在本示例“客户陈述发动机问题”中,首次维修为“是”的12234 VIN组应与同一VIN组的其他两个投诉进行比较。对于12234 VIN组为(1,1)(1,2)(1,3),对于12231 VIN组为(5,4)(5,5)(5,6)(5,7)。

所需的输出

  VIN                  Complaints                                  Repairs  Key   Text_distance
 12234          Customer states engine issues                        yes     1       1
 12234          Car wont start. Engine broke down                    no      2       1
 12234          Vehicle battery was replaced                         yes     3       0 
 12231          Car shut down, battery problem                       no      4       0
 12231          Cool and hot air coming from ac                      yes     5       1
 12231          Issue with temperature moderator, ac replaced        yes     6       1
 12231          air conditioner not working fine                     no      7       1

我尝试了以下代码,但无法获得预期的结果

from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer()
trsfm = vectorizer.fit_transform(df['Complaints'])
df['Text_distance']= df.groupby('VIN').apply(lambda x: x.cosine_similarity(trsfm.values, trsfm.values[:, None]))

如何解决?另外,请在我的情况下建议Jaccard / jaro_distance / cosine /其他方法是否更有效。

0 个答案:

没有答案