Question

我试图使用Gensim的Word2Vec工具在Python中测量许多文本之间的单词移动器距离。我正在将每个文本与所有其他文本进行比较，因此首先使用itertools创建成对的组合，例如[1,2,3] -> [(1,2), (1,3), (2,3)]。为了内存的缘故，我不通过在大数据框中重复所有文本来进行组合，而是使用文本索引制作参考数据框combinations，如下所示：

然后在比较函数中，我使用这些索引在原始数据框中查找文本。该解决方案可以正常工作，但是我想知道是否能够使用大型数据集。例如，我有300.000行的文本数据集，这使我在笔记本电脑上可以进行大约100年的计算：

C2(300000) = 300000! / (2!(300000−2))!
           = 300000⋅299999 / 2 * 1
           = 44999850000 combinations

有什么办法可以更好地对其进行优化？

我现在的代码：

import multiprocessing
import itertools
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus

def get_distance(row):
    try: 
        sent1 = df.loc[row[0], 'text'].split()
        sent2 = df.loc[row[1], 'text'].split()
        return model.wv.wmdistance(sent1, sent2)  # Compute WMD
    except Exception as e:
        return np.nan

df = pd.read_csv('data.csv')

# I then set up the gensim model, let me know if you need that bit of code too.

# Make pairwise combination of all indices
combinations = pd.DataFrame(itertools.combinations(df.index, 2))

# To dask df and apply function
dcombinations = dd.from_pandas(combinations, npartitions= 2 * multiprocessing.cpu_count())
dcombinations['distance'] = dcombinations.apply(get_distance, axis=1)
with ProgressBar():
    combinations = dcombinations.compute()

Answer 1

您可以使用wmd-relax来提高性能。但是，您首先必须将模型转换为spaCy，并按照其网页上的说明使用“相似性挂钩”：

import spacy
import wmd

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

我可以优化此单词移动器的距离查找功能吗？

1 个答案: