我想计算亲和力传播的调整兰特指数。我有一个包含这样句子的数据集:
Youtube
Facebook
Whatsapp
Open Youtube
My Affinity Propagation代码如下:
import nltk, string
from sklearn.feature_extraction.text
import TfidfVectorizer from sklearn.cluster
import AffinityPropagation
import pandas as pd
punctuation_map = dict((ord(char), None) for char in string.punctuation) stemmer = nltk.stem.snowball.SpanishStemmer()
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize)
def get_clusters(sentences):
tf_idf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = (tf_idf_matrix * tf_idf_matrix.T).A
affinity_propagation = AffinityPropagation(affinity="precomputed", damping=0.5)
affinity_propagation.fit(similarity_matrix)
labels = affinity_propagation.labels_
global cluster_centers
cluster_centers = affinity_propagation.cluster_centers_indices_
tagged_sentences = zip(sentences, labels)
clusters = {}
for sentence, cluster_id in tagged_sentences:
clusters.setdefault(sentences[cluster_centers[cluster_id]], []).append(sentence)
#print(len(sentence))
return clusters
#csv file filename = "/home/ubuntu/data/local_queries.csv" df = pd.read_csv(filename, header = None)
sentences = df.iloc[:, 0].values.tolist()
clusters = get_clusters(sentences) print() for cluster in clusters:
print(cluster, ':')
for element in clusters[cluster]:
print(' - ', element)
对于ARI,我们需要实际标签和预测标签。我没有实际标签,因为我的数据集中只有句子。任何人都可以解释在这种情况下我应该如何计算ARI?
答案 0 :(得分:2)
ARI是外部评估指标。
只能 用于比较两个结果。通常,您将聚类与已知类标签进行比较,以测试实现是否正常。
如果您只有一个结果且没有" true" 标签强>
在这种情况下,您只能使用内部评估措施,并有其所有缺点。