我正在使用Affinity Propagation Clustering进行句子聚类。作为中间步骤,我正在计算相似度矩阵。它适用于小型数据集,但会丢失大型数据集的内存错误。我有一个包含句子的数据集。
示例数据集:
'open contacts',
'open music player',
'play song',
'call john',
'open camera',
'video download',
...
我的代码:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation
import pandas as pd
punctuation_map = dict((ord(char), None) for char in string.punctuation)
stemmer = nltk.stem.snowball.SpanishStemmer()
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize)
def get_clusters(sentences):
tf_idf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = (tf_idf_matrix * tf_idf_matrix.T).A
affinity_propagation = AffinityPropagation(affinity="precomputed", damping=0.5)
affinity_propagation.fit(similarity_matrix)
# global labels
labels = affinity_propagation.labels_
# global cluster_centers
cluster_centers = affinity_propagation.cluster_centers_indices_
tagged_sentences = zip(sentences, labels)
clusters = {}
for sentence, cluster_id in tagged_sentences:
clusters.setdefault(sentences[cluster_centers[cluster_id]], []).append(sentence)
#print(len(sentence))
return clusters
#csv file
filename = "/home/ubuntu/VA_data/first_50K.csv"
df = pd.read_csv(filename, header = None)
sentences = df.iloc[:, 0].values.tolist()
clusters = get_clusters(sentences)
任何人都可以建议我找到相似矩阵的有效方法吗?我的数据集包含100万个句子。
答案 0 :(得分:0)
一种可能的方法是将数据存储在Spark中,这也提供可扩展的矩阵乘法。