不同的聚类标签

时间:2017-02-24 21:50:59

标签: machine-learning cluster-analysis k-means

我正在尝试对在培训期间未见过的新数据进行聚类,并且仅包括在测试数据中。训练文件有五个类,而测试数据有7个类(5 + 2),其中2个是新类。现在,我想运行k-mean来为新的添加类找到合适的集群,或者如果它们不靠近任何集群,则为每个集合创建新的集群。

这是我的代码的一部分:



print("Reading training data...")
#mydata = pd.read_csv('.\KDDTrain.csv', header=0)
mydata = pd.read_csv('.\PTraining.csv', header=0)

# select all but the last column as data
X_train = mydata.ix[1:, :-1]
X_train = np.array(X_train)
n_samples, n_features = np.shape(X_train)
# print np.shape(X_train)

# select last column as target/class
y_train = mydata.ix[1:, n_features]
y_train = np.array(y_train)

# encode target labels with numeric values from 0 to no of classes
# print "Encoding class labels..."
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y_train)
# print list(label_encoder.classes_)
# print 'total no of classes in dataset=' + str(len(label_encoder.classes_))
y_train = label_encoder.transform(y_train)

# n_samples, n_features = data.shape
n_digits = len(np.unique(y_train))

print("Training data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
      % (n_digits, n_samples, n_features))

sample_size = 300

# Read test data
mytestdata = pd.read_csv('.\KDDTest+.csv', header=0)

print("Reading test data...")
# select all but the last column as data
X_test = mytestdata.ix[1:, :-1]
X_test = np.array(X_test)
# print np.shape(X_test)

# select last column as target/class
y_test = mytestdata.ix[1:, n_features]
# print "actual labels"
# print y_test
y_test = label_encoder.transform(y_test)
# print "Encoded labels"
# print y_test
y_test = np.array(y_test)

n_samples_test, n_features_test = np.shape(X_test)
n_digits_test = len(np.unique(y_test))
print("Test data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
      % (n_digits_test, n_samples_test, n_features_test))

print(79 * '_')



   并给出此错误



File "C:/Users/aalsham4/PycharmProjects/clusteringtask/clustering.py", line 87, in <module>
    y_test = label_encoder.transform(y_test)
  File "C:\Users\aalsham4\AppData\Local\Continuum\Miniconda3\lib\site-packages\sklearn\preprocessing\label.py", line 153, in transform
    raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['calss6' 'class7' ]
&#13;
&#13;
&#13;

现在,我不确定我是否正确地执行此操作以对已标记的类进行聚类。

任何建议

1 个答案:

答案 0 :(得分:0)

正如@ Anony-Mousse所说,这不是一个k-means问题。 k-means是找到&#34; natural&#34;分组,给定你想要的课程数量。分配这些标签后,进一步的更新不再是k-means问题。

您可以使用各种统计分析启发式方法来确定新课程是否足够接近&#34;到现有的课程。这通常使用均值和偏差的度量(你已经为k-means类提供),密度以及你发现的与你的问题相关的任何其他东西。

我建议您研究谱聚类算法,并在整个数据集上进行尝试;那些更适合寻找间隙,对密度做出反应等(取决于您为此应用选择的算法)。