使用PCA和小数据集降低维度的速度太慢

时间:2015-07-17 11:52:48

标签: python numpy pca

我使用numpy存储了以下数据集: https://www.dropbox.com/sh/ppseiv9skqlhljr/AACQEWZh11oszL5-Z_NHqre3a?dl=0

数据集的训练和开发分区有一个不同的numpy文件

[50,1,396]

我正在使用mlpy库中的PCA Fast来执行降维。然而,整个过程太慢,我找不到原因。 在执行PCA之前,我将数据集转换为以下形状:

[50,396]

因此数据集的形状不是我问题的原因。

我使用的代码如下:

import os
import numpy as np
import sys
import csv
import mlpy

inputfiletrain=''
outputfiletrain=''
inputfiledev=''
outputfiledev=''

def parseCommandLineArgs():
        global inputfiletrain
        global outputfiletrain
        global inputfiledev
        global outputfiledev

        for i in range(0, len(sys.argv)):

                if sys.argv[i] == 'inputfiletrain':
                        inputfiletrain = sys.argv[i + 1]
                        print
                        print "------*****Using Directory :*****------"
                        print 'inputfiletrain=' + inputfiletrain
                        print "------**********************------"
                        print

                if sys.argv[i] == 'outputfiletrain':
                        outputfiletrain = sys.argv[i + 1]
                        print
                        print "------*****Using Directory :*****------"
                        print 'outputfiletrain=' + outputfiletrain
                        print "------**********************------"
                        print

                if sys.argv[i] == 'inputfiledev':
                        inputfiledev = sys.argv[i + 1]
                        print
                        print "------*****Using Directory :*****------"
                        print 'inputfiledev=' + inputfiledev
                        print "------**********************------"
                        print

                if sys.argv[i] == 'outputfiledev':
                        outputfiledev = sys.argv[i + 1]
                        print
                        print "------*****Using outputFeatures Filename :*****------"
                        print 'outputfiledev=' + outputfiledev
                        print "------**********************------"
                        print




def pcaDimRed(features, nDims):
        X=np.empty([features.shape[0], features.shape[2]])
        print features.shape[2]
        print X.shape


        for i,f in enumerate(features):

            #np.append(X,f[0],axis=0)

            X[i]=f[0]
        #np.vstack(X)


        print X
        print "PCAStarting"
    #pca = mlpy.PCA(method='cov')
    pca=  mlpy.PCAFast(k=nDims, eps=0.1)
    pca.learn(X)
    coeff = pca.coeff()
    coeff = coeff[:,0:nDims]

        print "PCAEnding"
    featuresNew = []
    for f in X:
        ft = f.copy()
#       ft = pca.transform(ft, k=nDims)
        ft = np.dot(f, coeff)
        featuresNew.append(ft)


        thodwrisformat = np.empty((len(files), 1, mean.shape[0]))
        for i,f in enumerate(featuresNew):
            thodwrisformat[i][0]=f

    return (thodwrisformat, coeff)

def pcaDevelopmentSet(features, nDims,coeff):

        featuresNew = []

                for f in features:
                        ft = f.copy()
        #       ft = pca.transform(ft, k=nDims)
                        ft = np.dot(f, coeff)
                        featuresNew.append(ft)
                return featuresNew

parseCommandLineArgs()
print inputfiledev
FeaturesDev = np.load(inputfiledev)
FeaturesTrain = np.load(inputfiletrain)

PCATrain=pcaDimRed(FeaturesTrain,68)
FeaturesTrain=PCATrain[1]
coeff=PCATrain[2]
FeaturesDev=pcaDevelopmentSet(FeaturesDev, 68,coeff)


np.save(outputfiledev,FeaturesDev)
np.save(outputfiletrain,FeaturesTrain)

我在ubuntu linux和python 2.7下使用此代码。要安装mlpy,必须使用以下命令:

wget http://sourceforge.net/projects/mlpy/files/mlpy%203.5.0/mlpy-3.5.0.tar.gz
tar xvf mlpy-3.5.0.tar.gz
cd mlpy-3.5.0
sudo python setup.py install

最后要运行此代码,假设脚本存储为pca.py并且它与包含数据集分区的目录feature_vectors位于同一文件夹中,则必须使用以下命令:

python pca.py inputfiletrain feature_vectors/train/featuresShape.npy outputfiletrain feature_vectors/train/featuresShapePCA.npy inputfiledev feature_vectors/development/featuresShape.npy outputfiledev feature_vectors/development/featuresShapePCA.npy 

我需要一些想法,为什么PCA在这个数据集上的速度很慢......

1 个答案:

答案 0 :(得分:1)

关于你的讨论:

  • 如果您测量每批次的牢度:由于更高的维度,即数据形状为396,您的过程将会变慢。
  • 如果您测量每个纪元的牢度:由于更多数据,您的过程会更慢,即50x396 = 19800与100x100随机示例相比。