我有一个包含近一百万行和一列的数据框,如下所示:
VIN Complaints Repairs Key
12234 Customer states engine issues yes 1
12234 Car wont start. Engine broke down no 2
12234 Vehicle battery was replaced yes 3
12231 Car shut down, battery problem no 4
12231 Cool and hot air coming from ac yes 5
12231 Issue with temperature moderator, ac replaced yes 6
12231 air conditioner not working fine no 7
我希望按“ VIN”对df分组,并将“投诉”的成对文本相似性与“维修”的第一个“是”进行比较。
例如,在本示例“客户陈述发动机问题”中,首次维修为“是”的12234 VIN组应与同一VIN组的其他两个投诉进行比较。对于12234 VIN组为(1,1)(1,2)(1,3),对于12231 VIN组为(5,4)(5,5)(5,6)(5,7)。
所需的输出
VIN Complaints Repairs Key Text_distance
12234 Customer states engine issues yes 1 1
12234 Car wont start. Engine broke down no 2 1
12234 Vehicle battery was replaced yes 3 0
12231 Car shut down, battery problem no 4 0
12231 Cool and hot air coming from ac yes 5 1
12231 Issue with temperature moderator, ac replaced yes 6 1
12231 air conditioner not working fine no 7 1
我尝试了以下代码,但无法获得预期的结果
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer()
trsfm = vectorizer.fit_transform(df['Complaints'])
df['Text_distance']= df.groupby('VIN').apply(lambda x: x.cosine_similarity(trsfm.values, trsfm.values[:, None]))
如何解决?另外,请在我的情况下建议Jaccard / jaro_distance / cosine /其他方法是否更有效。