python分类预测

时间:2018-01-02 20:55:20

标签: python pandas scikit-learn

我在python中使用了sklearn来读取excel表并根据其描述预测组名。我想采取的下一步是对类似的群体进行分组。我不确定哪种方法最能满足我的目的。

from __future__ import print_function
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np

path = 'data/mydata.csv'
exp = pd.read_csv(path, names=['group_name', 'description'])

X = exp.description
y = exp.group_name
fixed_X = X[pd.notnull(X)]
fixed_y = y[pd.notnull(y)]

vect = CountVectorizer(token_pattern=u'(?u)\\b\\w\\w+\\b')
nb = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(fixed_X, fixed_y, 
random_state=1)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
print(metrics.classification_report(y_test, y_pred_class))

这打印了预测组的准确性,这就是我想要的。如何根据提供的数据和预测对类似的“group_names”进行分组?

期望的输出(基于其背后的预测和数据)

if there are 10 groups total
group 1: [group_name1, group_name5,group_name10]
group 2: [group_name2, group_name4]
group 3: [group_name3, group_name6, group_name7, group_name9]
group 4: [group_name10]

(组的数量无关紧要,我只想要正确的group_names,同一组group_names都在一个组中。 或

一个可视化模型,显示所有组名称的聚类

0 个答案:

没有答案