我正在尝试学习数据科学,并在网上找到了这篇很棒的文章。
https://bergvca.github.io/2017/10/14/super-fast-string-matching.html
我有一个充满公司名称的数据库,但是发现相似度等于1的结果实际上是完全相同的行。我显然想捕获重复项,但是我不希望同一行匹配。
在旁注中,这让我对熊猫和NLP睁开了双眼。超级迷人的领域-希望有人可以在这里帮助我。
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct
pd.set_option('display.max_colwidth', -1)
df = pd.read_csv('CSV/Contacts.csv', dtype=str)
print(df.shape)
df.head(2)
形状:(72489,3)
Id Name Email
0 0031J00001bvXFTQA2 FRESHPOINT ATLANTA, INC dotcomp@sysco.com
1 0031J00001aJtFaQAK VIRGIL dotcom@corp.sysco.com
然后我清理数据
# Clean the data
df.dropna()
# df['Email'] = df['Email'].str.replace('[^a-zA-Z]', '')
# df['Email'] = df['Email'].str.replace(r'[^\w\s]+', '')
contact_emails = df['Email']
然后我实现了N-Grams函数
def ngrams(string, n=3):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
然后我得到TF-IDF矩阵
# get Tf-IDF Matrix
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(contact_emails.apply(lambda x: np.str_(x)))
然后我实现余弦相似度函数-我仍然不确定每个参数的作用。
def awesome_cossim_top(A, B, ntop, lower_bound=0):
# force A and B as a CSR matrix.
# If they have already been CSR, there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeros(M+1, dtype=idx_dtype)
indices = np.zeros(nnz_max, dtype=idx_dtype)
data = np.zeros(nnz_max, dtype=A.dtype)
ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)
return csr_matrix((data,indices,indptr),shape=(M,N))
然后我们实际上找到了匹配项。在这种情况下,我不确定移调是做什么的,以及如何找到匹配项的。
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.7)
然后是提取匹配项的功能。
def get_matches_df(sparse_matrix, email_vector,email_ids, top=5):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_name_Ids = np.empty([nr_matches], dtype=object)
right_name_Ids = np.empty([nr_matches], dtype=object)
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(nr_matches):
left_name_Ids[index] = email_ids[sparserows[index]]
left_side[index] = email_vector[sparserows[index]]
right_name_Ids[index] = email_ids[sparsecols[index]]
right_side[index] = email_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({
'SFDC_ID': left_name_Ids,
'left_side': left_side,
'right_SFDC_ID':right_name_Ids,
'right_side': right_side,
'similairity': similairity})
然后我调用该函数并传入参数
name_Ids = df['Id']
matches_df = get_matches_df(matches, contact_emails,name_Ids, top=72489)
现在我只想提取90%相似或更高的匹配项。
matches_df = matches_df[matches_df['similairity'] > 0.9]
然后我按相似度对值进行排序
matches_df.sort_values('similairity' )
所以我发现的是相同的行彼此匹配。我知道这是因为SFDC ID完全相同-为什么会发生这种情况?将来如何避免这种情况?显然,当发现相似之处时,我不希望该行自行评估。