Question

Scikit-Learn是一个很棒的Python模块，为support vector machine提供了许多算法。我过去几天一直在学习如何使用该模块，我发现它很大程度上依赖于单独的numpy模块。

我理解模块的功能，但我仍然在学习它是如何工作的。以下是我使用sklearn：

的简短示例

from sklearn import datasets, svm
import numpy

digits = datasets.load_digits() #image pixel data of digits 0-9 as well as a chart of the corresponding digit to each image

clf = svm.SVC(gamma=0.001,C=100) #SVC is the algorithm used for classifying this type of data

x,y = digits.data[:-1], digits.target[:-1] #feed it all the data
clf.fit(x,y) #"train" the SVM

print(clf.predict(digits.data[0])) #>>>[0]
#with 99% accuracy, all of the data consists of 1797 samples.
#if this number gets smaller, the accuracy decreases. with 10 samples (0-9),
#accuracy can still be up to as high as 90%.

这是非常基本的分类。 有10个类：0,1,2,3,4,5,6,7,8,9。

将以下代码与matplotlib.pyplot一起使用：

import matplotlib.pyplot as plt #in shell after running previous code
plt.imshow(digits.images[0],cmap=plt.cm.gray_r,interpolation="nearest")
plt.show()

给出以下图片：

第一个像素（从左到右，从上到下，如同阅读）将由0表示。第二个像素相同，但第三个像素将由7或其他（范围为0到15）表示，第四个像素大约13.这里是图像的实际数据：

[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]

所以我的问题是：如果我想对文本数据进行分类，例如错误的子论坛/类别中的论坛帖子，我该如何将该数据转换为此数据集示例中使用的数字系统？

Answer 1

对于每个样本（例如每个论坛帖子），你必须有一个向量（在python中列表）。例如，如果您有200个帖子及其各自的类别，则必须有200个训练数据列表和一个每200个类别有200个元素的列表。每个培训类别列表都可以是一个模型（例如Bag Of Word。见这里：https://en.wikipedia.org/wiki/Bag-of-words_model）。请注意，所有训练列表必须具有相同的元素（相同的维度）（例如，每个列表必须具有3000个元素，每个元素重复出现或不存在单词）试着看看这个，对于初学者来说很容易：https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

如何将数据转换为SKLearn的ndarray格式？

1 个答案: