Tf-idf匹配列表与列表,而不是一个列表

时间:2018-03-26 15:09:23

标签: python string-matching tf-idf

我是python的新手,我正在尝试使用tf-idf匹配。我按照this文章中的教程进行了操作。我想知道我是否可以匹配输入列表与已处理数据的另一个列表,然后获取此脚本以将输出作为输入列表中每个项目的现有第二个列表的潜在匹配返回。

我希望你们其中一个人能把我推向正确的方向!谢谢!

import pandas as pd

pd.set_option('display.max_colwidth', -1)
names = pd.read_csv('sample-data/descriptions_1.csv')


import re
def ngrams(string, n=4):
    string = re.sub(r'[,-./]|\sBD', r'', str(string))
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

from sklearn.feature_extraction.text import TfidfVectorizer

company_names = names['name']
comparer_names = comparer['name']
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(company_names)

import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct


def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape

    idx_dtype = np.int32

    nnz_max = M * ntop

    indptr = np.zeros(M + 1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data, indices, indptr), shape=(M, N))

import time
t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 30, 0.5)
t = time.time()-t1
print("SELFTIMED:", t)


def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]

    return pd.DataFrame({'left_side': left_side,
                         'right_side': right_side,
                         'similairity': similairity})

matches_df = get_matches_df(matches, company_names, top=1000)
matches_df = matches_df[matches_df['similairity'] < 0.99999] # Remove all exact matches
print(matches_df.sample(20))

file_name = str("hallo.csv")
matches_df.to_csv(file_name, sep=',', encoding='utf-8')

1 个答案:

答案 0 :(得分:0)

我能够通过对您在doc中链接到的方法进行一些修改来解决此问题。关键是要进行以下修改:

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
first_idf_matrix = vectorizer.fit_transform(first_lines)
second_idf_matrix = vectorizer.transform(second_lines)
matches = awesome_cossim_top(first_idf_matrix, second_idf_matrix, 1, 0)

这意味着函数get_matches_df不再有用,而是提取第一个列表中每一行的匹配行,我们执行以下操作:

for dirty_idx, _ in enumerate(first_lines):
    second_idx = matches[first_idx].argmax()