Question

我有一个pd.DataFrame (named df_sim)，如下所示（仅为示例）

修改

Type                                     Sent
icro soft student advantage      why should i connect to vpn
Microsoft Student Advantage      why should i connect to vpn
the V  P  N                      why should i connect to vpn
the V PN                         why should i connect to vpn
my laundry bucks                 why should i connect to vpn
CSI                              why should i connect to vpn

我的目标是从给定的“已发送”中找到最相似的“类型”。我正在执行余弦相似度，并且必要的代码如下。

df_sim['Type'] = df_sim['Type'].apply(lambda x : re.sub('[^A-Za-z0-9]',' ',x))
vect = TfIdfVectorizer()
vect.fit(df['Type'] +" " +df['Sent'])
A = vect.transform(df['Type'])
B = vect.transform(df['Sent'])
sim = paired_cosine_distances(A,B)  #from from sklearn.metrics.pairwise 
                                    import paired_cosine_distances
y = df_sim.iloc[np.argmin(sim)]['Type']

我期望np.argmin(sim)为2或3，因此y为'the V PN' or 'the V P N'。相反，我将np.argmin（sim）设为5，y为CSI 有了这个，我有两个问题：

我是否应该使用余弦相似性度量？正则表达式之类的东西能满足需要吗？
为什么在“类型”列中输入的短语不正确。在许多情况下，上述技术有效，但在这里无效。为什么？

列表和pd.DataFrame列之间的熊猫文本相似性

0 个答案: