应该给出什么作为链接函数的输入-tfidf矩阵或tfidf矩阵的不同元素之间的相似性?

时间:2019-01-06 13:04:10

标签: machine-learning cluster-analysis hierarchical-clustering cosine-similarity

我有以下python笔记本,旨在根据不同摘要之间的文本相似性对不同的摘要组进行聚类。 我在这里有两种方法:一种是使用tfidf numpy数组的文档,因为它在链接函数中,而第二种是找到不同文档的tfidf数组之间的相似性,然后使用该相似性矩阵进行聚类。我不明白哪一个是正确的。

方法1:

我使用余弦相似度找出tfidf矩阵的相似度矩阵(cosine)。我首先使用平方函数将冗余矩阵(cosine)转换为压缩距离矩阵(distance_matrix)。然后将distance_matrix输入到链接函数中,并使用树状图绘制了该图。

方法2:

我将tfidf numpy数组的压缩形式用于链接函数,并绘制了树状图。

我的问题是正确的吗?根据我所能理解的数据,方法2似乎是正确的,但对我来说方法1是有意义的。如果有人可以解释我在这种情况下的正确之处,那将是很好的。预先感谢。

让我知道问题中是否还有不清楚的地方。

import pandas, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

###Data Cleaning

stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
df=pandas.read_csv('WIPO_CSV.csv')


import sys
reload(sys)
sys.setdefaultencoding('utf8')


documents_no_stopwords=[]

def preprocessing(word):
    tokens = tokenizer.tokenize(word)

    processed_words = []
    for w in tokens:
        if w in stop_words:
            continue
        else:
            processed_words.append(w)

***This step creates a list of text documents with only the nouns in    them***
    documents_no_stopwords.append(' '.join(processed_words))

for text in df['TEXT'].tolist():
    preprocessing(text)

***Converting into tfidf form***
*Latin is used as utf8 decoder was facing some trouble with the text.*

vectoriser = TfidfVectorizer(encoding='latin1')

***we have numpy here which is in normalised form***

tfidf_documents = vectoriser.fit_transform(documents_no_stopwords)


##Cosine Similarity as the input to linkage should be a distance vector

from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform

cosine = cosine_similarity(tfidf_documents)
distance_matrix = squareform(cosine,force='tovector',checks=False)

from scipy.cluster.hierarchy import dendrogram, linkage

##Linkage based on tfidf of each document
z_num=linkage(tfidf_documents.todense(),'ward')

z_num  #tfidf

array([[11.        , 12.        ,  0.        ,  2.        ],
   [18.        , 19.        ,  0.        ,  2.        ],
   [20.        , 31.        ,  0.        ,  3.        ],
   [21.        , 32.        ,  0.        ,  4.        ],
   [22.        , 33.        ,  0.        ,  5.        ],
   [17.        , 34.        ,  0.38208619,  6.        ],
   [15.        , 28.        ,  1.19375843,  2.        ],
   [ 6.        ,  9.        ,  1.24241258,  2.        ],
   [ 7.        ,  8.        ,  1.27069483,  2.        ],
   [13.        , 37.        ,  1.28868301,  3.        ],
   [ 4.        , 24.        ,  1.30850122,  2.        ],
   [36.        , 39.        ,  1.32090275,  5.        ],
   [10.        , 16.        ,  1.32602346,  2.        ],
   [27.        , 38.        ,  1.32934025,  3.        ],
   [23.        , 25.        ,  1.32987072,  2.        ],
   [ 3.        , 29.        ,  1.35143582,  2.        ],
   [ 5.        , 14.        ,  1.35401753,  2.        ],
   [26.        , 42.        ,  1.35994878,  3.        ],
   [ 2.        , 45.        ,  1.40055438,  3.        ],
   [ 0.        , 40.        ,  1.40811825,  3.        ],
   [ 1.        , 46.        ,  1.41383622,  3.        ],
   [44.        , 50.        ,  1.4379821 ,  5.        ],
   [41.        , 43.        ,  1.44575227,  8.        ],
   [48.        , 51.        ,  1.45876241,  8.        ],
   [49.        , 53.        ,  1.47130328, 11.        ],
   [47.        , 52.        ,  1.49944936, 11.        ],
   [54.        , 55.        ,  1.69814818, 22.        ],
   [30.        , 56.        ,  1.91299937, 24.        ],
   [35.        , 57.        ,  3.1967033 , 30.        ]])

from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
dn = dendrogram(z_num)
plt.show()

基于相似性的链接

z_sim=linkage(distance_matrix,'ward')
z_sim  *Cosine Similarity*

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [2.00000000e+00, 3.00000000e+01, 0.00000000e+00, 3.00000000e+00],
   [1.70000000e+01, 3.10000000e+01, 0.00000000e+00, 4.00000000e+00],
   [3.00000000e+00, 4.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [1.00000000e+01, 3.30000000e+01, 0.00000000e+00, 3.00000000e+00],
   [5.00000000e+00, 7.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [6.00000000e+00, 1.80000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.10000000e+01, 1.90000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.20000000e+01, 2.00000000e+01, 0.00000000e+00, 2.00000000e+00],
   [8.00000000e+00, 2.40000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.60000000e+01, 2.10000000e+01, 0.00000000e+00, 2.00000000e+00],
   [2.20000000e+01, 2.70000000e+01, 0.00000000e+00, 2.00000000e+00],
   [9.00000000e+00, 2.90000000e+01, 0.00000000e+00, 2.00000000e+00],
   [2.60000000e+01, 4.20000000e+01, 0.00000000e+00, 3.00000000e+00],
   [1.40000000e+01, 3.40000000e+01, 3.97089886e-03, 4.00000000e+00],
   [2.30000000e+01, 4.40000000e+01, 1.81733052e-02, 5.00000000e+00],
   [3.20000000e+01, 3.50000000e+01, 2.14592323e-02, 6.00000000e+00],
   [2.50000000e+01, 4.00000000e+01, 2.84944415e-02, 3.00000000e+00],
   [1.30000000e+01, 4.70000000e+01, 5.02045376e-02, 4.00000000e+00],
   [4.10000000e+01, 4.30000000e+01, 5.10902795e-02, 5.00000000e+00],
   [3.70000000e+01, 4.50000000e+01, 5.40176402e-02, 7.00000000e+00],
   [3.80000000e+01, 3.90000000e+01, 6.15118462e-02, 4.00000000e+00],
   [1.50000000e+01, 4.60000000e+01, 7.54874869e-02, 7.00000000e+00],
   [2.80000000e+01, 5.00000000e+01, 9.55487454e-02, 8.00000000e+00],
   [5.20000000e+01, 5.30000000e+01, 3.86911095e-01, 1.50000000e+01],
   [4.90000000e+01, 5.40000000e+01, 4.16693529e-01, 2.00000000e+01],
   [4.80000000e+01, 5.50000000e+01, 4.58764920e-01, 2.40000000e+01],
   [3.60000000e+01, 5.60000000e+01, 5.23422380e-01, 2.60000000e+01],
   [5.10000000e+01, 5.70000000e+01, 5.49419077e-01, 3.00000000e+01]])

from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
dn = dendrogram(z_sim)
plt.show()

将数据聚类的准确性与此照片进行比较:https://drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing

我得到的树状图可在以下笔记本链接中找到:https://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing 使用Internet浏览器打开此html。

1 个答案:

答案 0 :(得分:1)

Scipy仅支持HAC的距离,不支持相似性。

然后结果应该是相同的。因此,没有“对”或“错”。

在某些时候,您需要线性化的距离矩阵。使用以下方法可能是最有效的:a)可以处理稀疏数据的方法(避免任何<FlatList horizontal={true} data={this.qtyList} keyExtractor={item => item.id.toString()} showsHorizontalScrollIndicator={false} renderItem={({ item }) => ( <TouchableHighlight onPress={() => { this.props.qtyListSelector(item.id) }} > <Card containerStyle={{ borderRadius: 5 }} > <Text> {item.qty} </Text> </Card> </TouchableHighlight> )} /> 调用),b)直接生成线性化形式,而不是生成整个矩阵并然后丢掉一半。