K均值聚类不均匀,两个聚​​类中的数据相同。蟒蛇

时间:2019-03-13 08:49:58

标签: python machine-learning cluster-computing k-means text-classification

我是机器学习的新手,希望在文本聚类方面有所帮助。 如果您愿意,请提出代码更改建议。

我的问题陈述是将输入数据聚类为多个聚类。为此,我正在使用tfidfvectorizer,设置词干,标记化并应用k-means算法。

在输出中,我在两个不同的群集中接收相同的数据,但是我希望它们位于同一群集中。

请在下面找到我拥有的示例数据和我编写的代码。

Seat Allocation has been delayed, please wait sometime., Next Update Date : 03/03/2016 16:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 04/05/2018 15:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 05/06/2013 14:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 06/07/2014 13:05:21

Seat Allocation has been delayed., Next Update Date : 06/03/2018 08:44:48
Seat Allocation has been delayed., Next Update Date : 23/02/2018 15:36:18  
Seat Allocation has been delayed., Next Update Date : 08/03/2018 11:19:26
Seat Allocation has been delayed., Next Update Date : 20/03/2018 09:41:21
Seat Allocation has been delayed., Next Update Date : 27/07/2018 11:13:37
Seat Allocation has been delayed., Next Update Date : 22/01/2018 13:46:25

Need background Verification
Need background Verification
Need background Verification
Need background Verification

Sent for verification
Sent for verification
Sent for verification
Sent for verification

数据:“座位分配已延迟...”正在两个不同的群集下进行。 例如

数据:

Seat Allocation has delayed, Next Update Date : 03/03/2015 16:05:21 0
Seat Allocation has delayed, Next Update Date : 03/04/2016 16:05:22 0

将进入群集0,并且

Seat Allocation has been delayed, please wait sometime., Next Update Date : 04/05/2018 15:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 05/06/2013 14:05:21

将进入群集1。

我尝试减少和增加编号。集群也。仍然无法正常工作。我使用的编程语言是Python

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
get_ipython().run_line_magic('matplotlib', 'inline')
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

data = pd.read_excel("C:\\Users\\Desktop\\project\\SampleInput.xlsx")  

punc = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}',"%"]
stop_words = text.ENGLISH_STOP_WORDS.union(punc)
desc = data['comments_long'].values
vectorizer = TfidfVectorizer(stop_words = stop_words)
X = vectorizer.fit_transform(desc)

word_features = vectorizer.get_feature_names()

#STEMMING and Tokenizing

stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')

def tokenize(text):
    return [stemmer.stem(word) for word in tokenizer.tokenize(text.lower())]

#vectorization with stop worlds
vectorizer2 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize)
X2 = vectorizer2.fit_transform(desc)
word_features2 = vectorizer2.get_feature_names()
print(len(word_features2))
print(word_features2[:50])

vectorizer3 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize, max_features = 1000)
X3 = vectorizer3.fit_transform(desc)
words = vectorizer3.get_feature_names()

#kmeans
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
    kmeans.fit(X3)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.savefig('elbow.png')
plt.show()

true_k = 3
kmeans = KMeans(n_clusters = true_k, n_init = 20, n_jobs = 1) # n_init(number of iterations for clsutering) n_jobs(number of cpu cores to use)
kmeans.fit(X3)
# We look at 5 the clusters generated by k-means.
common_words = kmeans.cluster_centers_.argsort()[:,-1:-26:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

data_new = pd.DataFrame()
data['Cluster_Id'] = kmeans.labels_
data_new['X']=desc
data_new['Cluster_Id'] = kmeans.labels_
data_new.to_excel('outputNew'+str(true_k)+'test.xlsx',sheet_name='All_Data', index=False)

from openpyxl import load_workbook
book = load_workbook('outputNew'+str(true_k)+'test.xlsx')
writer = pd.ExcelWriter('outputNew'+str(true_k)+'test.xlsx', engine='openpyxl') 
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)

for i in range(true_k):
    data_new[data_new['Cluster_Id']==i].to_excel(writer,sheet_name='cluster'+str(i), index=False)

writer.save()

0 个答案:

没有答案