动态地将每个文档的相似性矩阵分配给数组以导出到JSON

时间:2017-03-11 20:04:10

标签: python dictionary scikit-learn sparse-matrix cosine-similarity

我对Python很陌生,所以我确信这是一件很简单的事,我没有做,但我无法理解。我已经为我的语料库中的每个文档创建了相似性矩阵,我想将它们分配回带有文档名称键的字典,以跟踪每个文档之间的相似性。

但是,它会不断地为每个键分配最后一个矩阵,而不是为键指定相应的矩阵。

import pandas as pd
import numpy as np
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import os

path = "stories/"
token_dict = {}
stemmer = PorterStemmer()

def tokenize(text):
   tokens = nltk.word_tokenize(text)
   stems = stem_tokens(tokens, stemmer)
   return stems

def stem_tokens(tokens, stemmer):
    stemmed_words = []
    for token in tokens:
        stemmed_words.append(stemmer.stem(token))
    return stemmed_words


for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = subdir + os.path.sep + file
        with open(file_path, "r", encoding = "utf-8") as file:
            story = file
            text = story.read()
            lowers = text.lower()
            map = str.maketrans('', '', string.punctuation)
            no_punctuation = lowers.translate(map)
            token_dict[file.name.split("\\", 1)[1]] = no_punctuation

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())

termarray = tfs.toarray()
nparray = np.array(termarray)
rows, cols = nparray.shape

 similarity = []
 for document in docdict:
    for row in range(0, rows-1):
       similarity = cosine_similarity(tfs[row:row+1], tfs)
       docdict[document] = similarity

一切都按预期工作,直到分配回来。

这会产生一个字典:

{'98ststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,    0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,  0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory2.txt': array([[ 0.10586559,  0.04742287,  0.02478352,     0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]])

为每个文档分配第二个到最后一个文档。虽然这只是一个简单的关闭,但真正的问题是它们都被分配了相同的矩阵。

我为一份文件得到的矩阵如下:

array([[ 1.        ,  0.07015725,  0.01593837,  0.05618977,  0.03892873,
         0.02434279,  0.06029888,  0.02261425,  0.03531677,  0.02975444,
         0.01835854,  0.02145624,  0.00985163,  0.03645598,  0.0497407 ,
         0.04482995,  0.06677013,  0.03153055,  0.10919878,  0.12029462,
         0.07255828,  0.05499581,  0.06330188,  0.04719668,  0.08909685,
         0.04484428,  0.06725359,  0.04453039,  0.02381673,  0.02639529,
         0.01012012,  0.0218679 ,  0.09989828,  0.10586559,  0.01535069]])

这是每个文件与第一个文件的相应相似之处。我想要的是一个字典,看起来像这样:

{
    story1:
          {
              story1: 1.,
              story2: 0.07015725,
              story3: 0.01593837,
              story4: 0.05618977... 
          }
    story2:
          {
              story1: ...
          }
 }

..等等。

示例数据集如下所示:

story1 = """Four other streets were renamed in Cork at the turn of the last   century to celebrate this event: Wolfe Tone St. (Previously Fair Lane), John Philpot Curran St. (Philpot’s Lane), Emmet (Nelson’s) Place and Sheare’s (Nile) St."""
story2 = """Oliver Plunkett Street was originally named George's Street after George I, the then reigning King of Great Britain and Ireland. In 1920, during the Burning of Cork, large parts of the street were destroyed by British troops."""
story3 = """Alfred Street is a connecting Street between Kent Train Station and MacCurtain Street. Present Cork city centre signage uses letters inspired by the book of Kells. This has been an inspiration for many typefaces in the past, including the Petrie's 'B' typface, and Monotype's 'Column Cille', which was widely used for school textbooks."""

运行脚本,这将产生如下相似的矩阵:

[[ 1.          0.05814422  0.06032458]]
[[ 0.05814422  1.          0.21323354]]
[[ 0.06032458  0.21323354  1.        ]]

其中每个都是1 * n矩阵,对应于每个文档的相似性。我想把它变成一个字典,允许我看到每个文档与其他文档的具体相似性,如下所示:

{
    story1: {
                story1: 1.,
                story2: 0.05814422,
                story3: 0.06032458
            },
    story2: {
                story1: 0.05814422,
                story2: 1.,
                story3: 0.21323354
            },
    story3: {
                story1: 0.06032458,
                story2: 0.21323354,
                story3: 1.
            }
}

我确信这是一个基本问题,但我对Python的数据结构缺乏了解,任何帮助都会受到极大的赞赏!

1 个答案:

答案 0 :(得分:0)

假设您有以下相似性矩阵:

sim = cosine_similarity(tfs)

In [261]: sim
Out[261]:
array([[ 1.        ,  0.09933054,  0.08911641],
       [ 0.09933054,  1.        ,  0.27252107],
       [ 0.08911641,  0.27252107,  1.        ]])

注意:我们不需要循环来计算相似性矩阵

使用Pandas module我们可以执行以下操作:

In [262]: df = pd.DataFrame(sim,
                            columns=list(token_dict.keys()),
                            index=list(token_dict.keys()))

数据帧:

In [263]: df
Out[263]:
          story1    story2    story3
story1  1.000000  0.099331  0.089116
story2  0.099331  1.000000  0.272521
story3  0.089116  0.272521  1.000000

现在我们可以轻松地将DataFrame转换为dict

In [264]: df.to_dict()
Out[264]:
{'story1': {'story1': 1.0000000000000009,
  'story2': 0.099330538266243495,
  'story3': 0.089116410701360893},
 'story2': {'story1': 0.099330538266243495,
  'story2': 0.99999999999999911,
  'story3': 0.27252107037687257},
 'story3': {'story1': 0.089116410701360893,
  'story2': 0.27252107037687257,
  'story3': 1.0}}

或直接转到JSON:

df.to_json('/path/to/file.json')