用Weka训练分类器太慢了

时间:2015-05-05 17:56:27

标签: java weka

我正在使用Weka构建分类器,我的数据集是稀疏的(文本数据)。我需要自己构建我的特征向量,而不使用Weka的utiltiy类将我的文本文档转换为特征向量。问题是训练任何分类器非常慢,尽管特征和样本的数量很少!

我用人工稀疏特征向量样本编写了一个测试用例,以显示它有多慢。你可以运行它。

public static void test() throws Exception {
        System.out.println( "Started test ... " + new Date() );

        Classifier clf = new SimpleLogistic();
        int numberOfFeatures = 2000;
        int numberOfSamples = 6000;
        Random rnd = new Random(0);

        //Define dataset
        FastVector attributes = new FastVector(numberOfFeatures + 1);
        for (Integer i = 0; i < numberOfFeatures; i++) {
            attributes.addElement( new Attribute(i.toString()) );
        }

        FastVector classes = new FastVector( 2 );
        classes.addElement( "Positive" );
        classes.addElement( "Negative" );

        attributes.addElement( new Attribute( "class", classes ) );
        Instances data = new Instances("", attributes, 100);
        data.setClassIndex(data.numAttributes()-1);

        //Create artifical sparse feature vectors for the positive class
        for ( int i = 0; i < numberOfSamples/2; i++ ) {
            double[] vec = new double[numberOfFeatures + 1];
            vec[rnd.nextInt(numberOfFeatures)] = 1;
            vec[rnd.nextInt(numberOfFeatures)] = 1;
            vec[rnd.nextInt(numberOfFeatures)] = 1;
            vec[rnd.nextInt(numberOfFeatures)] = 1;

            Instance instance = new Instance(1.0, vec);
            instance.setDataset(data);
            Instance sparseInstance = new SparseInstance(instance);
            sparseInstance.setDataset(data);
            sparseInstance.setClassValue("Positive");
            data.add(sparseInstance);
        }

        //Create artifical sparse feature vectors for the negative class
        for ( int i = 0; i < numberOfSamples/2; i++ ) {
            double[] vec = new double[numberOfFeatures + 1];
            vec[rnd.nextInt(numberOfFeatures)] = 1;
            vec[rnd.nextInt(numberOfFeatures)] = 1;
            vec[rnd.nextInt(numberOfFeatures)] = 1;
            vec[rnd.nextInt(numberOfFeatures)] = 1;

            Instance instance = new Instance(1.0, vec);
            instance.setDataset(data);
            Instance sparseInstance = new SparseInstance(instance);
            sparseInstance.setDataset(data);
            sparseInstance.setClassValue("Negative");
            data.add(sparseInstance);
        }
        System.out.println( "Building classifier ... " );
        clf.buildClassifier(data);
        System.out.println( new Date() );
    }

我不确定我是否应该采取措施加快速度!它实际上对我没有意义,因为梯度下降应该非常快。我尝试了一个MultilayerPerceptron分类器,其中包含一个隐藏图层和一个隐藏单元以及一个时代,并且真的非常很慢!

修改

我尝试了与测试用例相同的想法,但使用scikit-learn并且它很快就像光一样!这里:

import numpy as np
import random
from sklearn import linear_model

numberOfFeatures = 2000;
numberOfSamples = 6000;

X = np.zeros( (numberOfSamples, numberOfFeatures) )
y = np.zeros(numberOfSamples)

for i in xrange( numberOfSamples ):
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
    X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;

for i in xrange( 100 ):
    y[i] = 1


clf = linear_model.LogisticRegression()
print 'fitting'
clf.fit(X, y)

print 'done!'

0 个答案:

没有答案