Question

我打算使用SVD了解PCA，因此实现了它并尝试在MNIST数据上使用它。

import numpy as np

class PCA(object):

    def __init__ (self, X):

        self.N, self.dim, *rest = X.shape
        self.X = X

        '''
        U S V' = svd(X) 
        '''
        X_std = (X - np.mean(X, axis=0))/(np.std(X, axis=0)+1e-13)

        [self.U, self.s, self.Vt] = np.linalg.svd(X_std)
        self.V = self.Vt.T
        self.variance_ratio = self.s


    def variance_explained_ratio (self):

        '''
        Returns the cumulative variance captured with each added principal component
        '''
        return np.cumsum(self.variance_ratio)/np.sum(self.variance_ratio)

    def X_projected (self, r):

        '''
        Returns the data X projected along the first r principal components
        '''

        if r is None:
            r = self.dim
        X_proj = np.zeros((r, self.N))
        P_reduce = self.V[:,0:r]
        X_proj = self.X.dot(P_reduce)
        return X_proj

现在有了PCA的这个实现，我尝试将它应用于MNIST数据，以查看使用和不使用PCA进行分类的性能，使用softmax进行分类。代码如下：

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# Using first 10000 images 
train_data = mnist.train.images[:10000,:]
train_labels = mnist.train.labels[:10000,:]
pca1 = PCA(train_data)
pca_test = PCA(mnist.test.images)

n_components = 14
X_proj1 = pca1.X_projected(r=n_components)
X_projTest = pca_test.X_projected(r=n_components)

t1 = time.time()

x = tf.placeholder(tf.float32, [None, n_components])
W = tf.Variable(tf.zeros([n_components, 10]))
b = tf.Variable(tf.zeros([10]))


y = tf.cast(tf.nn.softmax(tf.matmul(x, W) + b), tf.float32)
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_*tf.log(y), 
reduction_indices=[1]))

train_step = 
tf.train.GradientDescentOptimizer(0.7).minimize(cross_entropy)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

m = 10000

for _ in range(1000):
    indices = random.sample(range(0, m), 100)
    batch_xs = X_proj1[indices]
    batch_ys = train_labels[indices]
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


accuracy = sess.run(accuracy, feed_dict={x: X_projTest, y_: 
mnist.test.labels})
print("Accuracy: %f" % accuracy)
sess.close()
t2 = time.time()
print ("Total time taken: %f seconds" % (t2-t1))

我使用它获得的准确度仅为19％左右，而使用train_data和train_labels时，准确度超过90％。有人可以建议我哪里出错吗？

Answer 1

当我们使用PCA或特征缩放时，我们在训练数据集上设置基础参数，然后在测试数据集上应用/转换它。测试数据集不用于计算关键参数，或者在这种情况下，SVD应仅应用于训练数据集。例如在sklearn的PCA中，我们使用以下代码：

from sklearn.decomposition import PCA
pca = PCA(n_components = 'whatever number you want')
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

请注意，我们适合训练数据集X_train并在X_test上进行转换。

同样，对于上面的实现，不需要创建pca_test对象。将X_projTest变量调整为：

X_projTest = mnist.test.images.dot(pca1.V[:,0:n_components])

这应解决测试精度低的问题。

使用张量流的PCA对MNIST数据进行SVM

1 个答案: