我使用numpy存储了以下数据集: https://www.dropbox.com/sh/ppseiv9skqlhljr/AACQEWZh11oszL5-Z_NHqre3a?dl=0
数据集的训练和开发分区有一个不同的numpy文件
[50,1,396]
我正在使用mlpy库中的PCA Fast来执行降维。然而,整个过程太慢,我找不到原因。 在执行PCA之前,我将数据集转换为以下形状:
[50,396]
因此数据集的形状不是我问题的原因。
我使用的代码如下:
import os
import numpy as np
import sys
import csv
import mlpy
inputfiletrain=''
outputfiletrain=''
inputfiledev=''
outputfiledev=''
def parseCommandLineArgs():
global inputfiletrain
global outputfiletrain
global inputfiledev
global outputfiledev
for i in range(0, len(sys.argv)):
if sys.argv[i] == 'inputfiletrain':
inputfiletrain = sys.argv[i + 1]
print
print "------*****Using Directory :*****------"
print 'inputfiletrain=' + inputfiletrain
print "------**********************------"
print
if sys.argv[i] == 'outputfiletrain':
outputfiletrain = sys.argv[i + 1]
print
print "------*****Using Directory :*****------"
print 'outputfiletrain=' + outputfiletrain
print "------**********************------"
print
if sys.argv[i] == 'inputfiledev':
inputfiledev = sys.argv[i + 1]
print
print "------*****Using Directory :*****------"
print 'inputfiledev=' + inputfiledev
print "------**********************------"
print
if sys.argv[i] == 'outputfiledev':
outputfiledev = sys.argv[i + 1]
print
print "------*****Using outputFeatures Filename :*****------"
print 'outputfiledev=' + outputfiledev
print "------**********************------"
print
def pcaDimRed(features, nDims):
X=np.empty([features.shape[0], features.shape[2]])
print features.shape[2]
print X.shape
for i,f in enumerate(features):
#np.append(X,f[0],axis=0)
X[i]=f[0]
#np.vstack(X)
print X
print "PCAStarting"
#pca = mlpy.PCA(method='cov')
pca= mlpy.PCAFast(k=nDims, eps=0.1)
pca.learn(X)
coeff = pca.coeff()
coeff = coeff[:,0:nDims]
print "PCAEnding"
featuresNew = []
for f in X:
ft = f.copy()
# ft = pca.transform(ft, k=nDims)
ft = np.dot(f, coeff)
featuresNew.append(ft)
thodwrisformat = np.empty((len(files), 1, mean.shape[0]))
for i,f in enumerate(featuresNew):
thodwrisformat[i][0]=f
return (thodwrisformat, coeff)
def pcaDevelopmentSet(features, nDims,coeff):
featuresNew = []
for f in features:
ft = f.copy()
# ft = pca.transform(ft, k=nDims)
ft = np.dot(f, coeff)
featuresNew.append(ft)
return featuresNew
parseCommandLineArgs()
print inputfiledev
FeaturesDev = np.load(inputfiledev)
FeaturesTrain = np.load(inputfiletrain)
PCATrain=pcaDimRed(FeaturesTrain,68)
FeaturesTrain=PCATrain[1]
coeff=PCATrain[2]
FeaturesDev=pcaDevelopmentSet(FeaturesDev, 68,coeff)
np.save(outputfiledev,FeaturesDev)
np.save(outputfiletrain,FeaturesTrain)
我在ubuntu linux和python 2.7下使用此代码。要安装mlpy,必须使用以下命令:
wget http://sourceforge.net/projects/mlpy/files/mlpy%203.5.0/mlpy-3.5.0.tar.gz
tar xvf mlpy-3.5.0.tar.gz
cd mlpy-3.5.0
sudo python setup.py install
最后要运行此代码,假设脚本存储为pca.py并且它与包含数据集分区的目录feature_vectors位于同一文件夹中,则必须使用以下命令:
python pca.py inputfiletrain feature_vectors/train/featuresShape.npy outputfiletrain feature_vectors/train/featuresShapePCA.npy inputfiledev feature_vectors/development/featuresShape.npy outputfiledev feature_vectors/development/featuresShapePCA.npy
我需要一些想法,为什么PCA在这个数据集上的速度很慢......
答案 0 :(得分:1)
关于你的讨论: