我正在使用Weka
构建分类器,我的数据集是稀疏的(文本数据)。我需要自己构建我的特征向量,而不使用Weka
的utiltiy类将我的文本文档转换为特征向量。问题是训练任何分类器非常慢,尽管特征和样本的数量很少!
我用人工稀疏特征向量样本编写了一个测试用例,以显示它有多慢。你可以运行它。
public static void test() throws Exception {
System.out.println( "Started test ... " + new Date() );
Classifier clf = new SimpleLogistic();
int numberOfFeatures = 2000;
int numberOfSamples = 6000;
Random rnd = new Random(0);
//Define dataset
FastVector attributes = new FastVector(numberOfFeatures + 1);
for (Integer i = 0; i < numberOfFeatures; i++) {
attributes.addElement( new Attribute(i.toString()) );
}
FastVector classes = new FastVector( 2 );
classes.addElement( "Positive" );
classes.addElement( "Negative" );
attributes.addElement( new Attribute( "class", classes ) );
Instances data = new Instances("", attributes, 100);
data.setClassIndex(data.numAttributes()-1);
//Create artifical sparse feature vectors for the positive class
for ( int i = 0; i < numberOfSamples/2; i++ ) {
double[] vec = new double[numberOfFeatures + 1];
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
Instance instance = new Instance(1.0, vec);
instance.setDataset(data);
Instance sparseInstance = new SparseInstance(instance);
sparseInstance.setDataset(data);
sparseInstance.setClassValue("Positive");
data.add(sparseInstance);
}
//Create artifical sparse feature vectors for the negative class
for ( int i = 0; i < numberOfSamples/2; i++ ) {
double[] vec = new double[numberOfFeatures + 1];
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
Instance instance = new Instance(1.0, vec);
instance.setDataset(data);
Instance sparseInstance = new SparseInstance(instance);
sparseInstance.setDataset(data);
sparseInstance.setClassValue("Negative");
data.add(sparseInstance);
}
System.out.println( "Building classifier ... " );
clf.buildClassifier(data);
System.out.println( new Date() );
}
我不确定我是否应该采取措施加快速度!它实际上对我没有意义,因为梯度下降应该非常快。我尝试了一个MultilayerPerceptron
分类器,其中包含一个隐藏图层和一个隐藏单元以及一个时代,并且真的非常很慢!
修改
我尝试了与测试用例相同的想法,但使用scikit-learn
并且它很快就像光一样!这里:
import numpy as np
import random
from sklearn import linear_model
numberOfFeatures = 2000;
numberOfSamples = 6000;
X = np.zeros( (numberOfSamples, numberOfFeatures) )
y = np.zeros(numberOfSamples)
for i in xrange( numberOfSamples ):
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
for i in xrange( 100 ):
y[i] = 1
clf = linear_model.LogisticRegression()
print 'fitting'
clf.fit(X, y)
print 'done!'