Question

我有df字符串值

   Keyword
    plant
    cell
    cat
    Pandas

我想找到这两个字符串值之间的关系或相关性。

我用过大熊猫corr = df1.corrwith(df2,axis=0)。但这对于查找数值之间的相关性很有用，但是我想通过找到相关距离来查看两个字符串是否相关。我该怎么办？

Answer 1

这里有几个步骤，您要做的第一件事是为每个单词提取某种矢量。

一个好的方法是使用gensim word2vec（您需要从here下载文件）：

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)

获得预训练的向量后，您需要为每个单词提取向量：

vector = model['plant']

或在“熊猫”列示例中：

df['Vectors'] = df['Keyword'].apply(lambda x: model[x])

完成此操作后，您可以使用多种方法来计算两个向量之间的距离，例如欧式距离：

from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(list(df['Vectors']))

距离将是一个矩阵，对角线为0，所有单词之间的距离为。距离越接近0，单词越相似。

您可以使用不同的模型和不同的距离度量，但是可以以此为起点。

Answer 2

通常情况下，上述加载模型的方法可能不起作用，因此我将与您分享对我有用的方法。我正在使用 Google Colab，因此使用了“！”在每个命令之前。

像这样使用 wget 下载文件（即模型）：

!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

接下来使用 gzip 使用以下命令解压缩文件：

!gzip -d GoogleNews-vectors-negative300.bin.gz

接下来使用 models 中的 gensim 库使用此代码加载下载的文件。这将为您提供 wordVector 模型以供进一步使用。我正在使用 Google Colab，因此如果您在本地执行此过程，文件路径可以更改：

from gensim import models
model = models.KeyedVectors.load_word2vec_format(
    '/content/GoogleNews-vectors-negative300.bin', binary=True)