我是机器学习的新手,希望在文本聚类方面有所帮助。 如果您愿意,请提出代码更改建议。
我的问题陈述是将输入数据聚类为多个聚类。为此,我正在使用tfidfvectorizer,设置词干,标记化并应用k-means算法。
在输出中,我在两个不同的群集中接收相同的数据,但是我希望它们位于同一群集中。
请在下面找到我拥有的示例数据和我编写的代码。
Seat Allocation has been delayed, please wait sometime., Next Update Date : 03/03/2016 16:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 04/05/2018 15:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 05/06/2013 14:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 06/07/2014 13:05:21
Seat Allocation has been delayed., Next Update Date : 06/03/2018 08:44:48
Seat Allocation has been delayed., Next Update Date : 23/02/2018 15:36:18
Seat Allocation has been delayed., Next Update Date : 08/03/2018 11:19:26
Seat Allocation has been delayed., Next Update Date : 20/03/2018 09:41:21
Seat Allocation has been delayed., Next Update Date : 27/07/2018 11:13:37
Seat Allocation has been delayed., Next Update Date : 22/01/2018 13:46:25
Need background Verification
Need background Verification
Need background Verification
Need background Verification
Sent for verification
Sent for verification
Sent for verification
Sent for verification
数据:“座位分配已延迟...”正在两个不同的群集下进行。 例如
数据:
Seat Allocation has delayed, Next Update Date : 03/03/2015 16:05:21 0
Seat Allocation has delayed, Next Update Date : 03/04/2016 16:05:22 0
将进入群集0,并且
Seat Allocation has been delayed, please wait sometime., Next Update Date : 04/05/2018 15:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 05/06/2013 14:05:21
将进入群集1。
我尝试减少和增加编号。集群也。仍然无法正常工作。我使用的编程语言是Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
get_ipython().run_line_magic('matplotlib', 'inline')
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
data = pd.read_excel("C:\\Users\\Desktop\\project\\SampleInput.xlsx")
punc = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}',"%"]
stop_words = text.ENGLISH_STOP_WORDS.union(punc)
desc = data['comments_long'].values
vectorizer = TfidfVectorizer(stop_words = stop_words)
X = vectorizer.fit_transform(desc)
word_features = vectorizer.get_feature_names()
#STEMMING and Tokenizing
stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')
def tokenize(text):
return [stemmer.stem(word) for word in tokenizer.tokenize(text.lower())]
#vectorization with stop worlds
vectorizer2 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize)
X2 = vectorizer2.fit_transform(desc)
word_features2 = vectorizer2.get_feature_names()
print(len(word_features2))
print(word_features2[:50])
vectorizer3 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize, max_features = 1000)
X3 = vectorizer3.fit_transform(desc)
words = vectorizer3.get_feature_names()
#kmeans
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
kmeans.fit(X3)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.savefig('elbow.png')
plt.show()
true_k = 3
kmeans = KMeans(n_clusters = true_k, n_init = 20, n_jobs = 1) # n_init(number of iterations for clsutering) n_jobs(number of cpu cores to use)
kmeans.fit(X3)
# We look at 5 the clusters generated by k-means.
common_words = kmeans.cluster_centers_.argsort()[:,-1:-26:-1]
for num, centroid in enumerate(common_words):
print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))
data_new = pd.DataFrame()
data['Cluster_Id'] = kmeans.labels_
data_new['X']=desc
data_new['Cluster_Id'] = kmeans.labels_
data_new.to_excel('outputNew'+str(true_k)+'test.xlsx',sheet_name='All_Data', index=False)
from openpyxl import load_workbook
book = load_workbook('outputNew'+str(true_k)+'test.xlsx')
writer = pd.ExcelWriter('outputNew'+str(true_k)+'test.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
for i in range(true_k):
data_new[data_new['Cluster_Id']==i].to_excel(writer,sheet_name='cluster'+str(i), index=False)
writer.save()