Python中基于词网的语义聚类

时间:2018-12-05 03:52:58

标签: python k-means pca wordnet

我正在使用pdf语料库并生成令牌列表。在这些标记中,我使用了最常见的10个单词来创建它们的簇,并根据它们的语义,我需要根据它们的相关含义来绘制标记的整个列表。我能够生成聚类,但希望根据散点图中的匹配引理在标记图中绘制向量。例如。在我的群集中,如果单词“ time”具有群集,则该单词与单词“ year”具有相同的引理,该单词出现在令牌列表中,但不在10个最常见单词的列表中。我们如何基于同义词集将令牌聚类?到目前为止,这是我能够绘制群集的进度:

import string
import re
import nltk
import PyPDF4
import numpy
from collections import Counter
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from nltk.corpus import wordnet
import matplotlib.pyplot as plt
import matplotlib.cm as cm


# Declaring all the variables
stopwords = nltk.corpus.stopwords.words('english')
# additional stopwords to be removed manually.
file = open('Corpus.txt', 'r')
moreStopwords = file.read().splitlines()
ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()

data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:
    pageData += page.extractText()


def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokenize = re.split("\W+", text)
    text = [wn.lemmatize(word) for word in tokenize if word not in stopwords]
    final = [word for word in text if word not in moreStopwords]
    # Accessing wordnet synset corpora to find the meaning of the words.
    # lemmas = []
    # for token in text:
    #     lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
    # # return list(set(lemmas)) # returns unique words
    # return list(lemmas)
    return final


filter_data = clean_text(pageData)
# get most common words & plot them on bar graph
most_common_words = [word for word, word_count in Counter(filter_data).most_common(10)]
word_freq = [word_count for word, word_count in Counter(filter_data).most_common(10)]
# Vectorizing most common words & filter data
mcw_vec = TfidfVectorizer()
fd_vec = TfidfVectorizer()
tfidf_mcw = mcw_vec.fit_transform(most_common_words)
tfidf_fd = fd_vec.fit_transform(filter_data)
# print(mcw_vec.get_feature_names())

# Create cluster
cluster = KMeans(n_clusters=len(most_common_words), max_iter=300, precompute_distances='auto', n_jobs=-1)
X = cluster.fit_transform(tfidf_mcw)
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:, 0], data2D[:, 0], c=numpy.random.random(len(most_common_words)))
plt.show()

filter_data是我来自pdf来源的所有标记的列表。 most_common_words是我用来创建10个群集的最常见单词的列表。如果这些词不属于这10个词的语义范畴,我可以消除这些词。我已经向量化了我所需要的一切,就是基于synset.lemmas()对过滤器数据进行聚类。

0 个答案:

没有答案