我在python中使用了sklearn来读取excel表并根据其描述预测组名。我想采取的下一步是对类似的群体进行分组。我不确定哪种方法最能满足我的目的。
from __future__ import print_function
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np
path = 'data/mydata.csv'
exp = pd.read_csv(path, names=['group_name', 'description'])
X = exp.description
y = exp.group_name
fixed_X = X[pd.notnull(X)]
fixed_y = y[pd.notnull(y)]
vect = CountVectorizer(token_pattern=u'(?u)\\b\\w\\w+\\b')
nb = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(fixed_X, fixed_y,
random_state=1)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
print(metrics.classification_report(y_test, y_pred_class))
这打印了预测组的准确性,这就是我想要的。如何根据提供的数据和预测对类似的“group_names”进行分组?
期望的输出(基于其背后的预测和数据)
if there are 10 groups total
group 1: [group_name1, group_name5,group_name10]
group 2: [group_name2, group_name4]
group 3: [group_name3, group_name6, group_name7, group_name9]
group 4: [group_name10]
(组的数量无关紧要,我只想要正确的group_names,同一组group_names都在一个组中。 或
一个可视化模型,显示所有组名称的聚类