增量PCA& partial_fit - 组件数量

时间:2018-02-22 13:49:18

标签: python machine-learning scikit-learn

我使用python和大约4000张手表图片(示例:watch_1watch_2)。图像为rgb,分辨率为450x450。我的目标是找到其中最相似的手表。出于这个原因,我使用./TAppEncoderStatic -c ../cfg/MV-HEVC/baseCfg_3view.cfg -q 36 -b ../testseq/balloons_00_1024x768_common_bin_QP29_base.bin -wdt 1024 -hgt 768 -fr 30 | tee out.log的{​​{3}}和IncrementalPCA来使用我的26GB内存来处理这些大数据(另请参阅:partial_fitSO_Link_1)。我的源代码如下:

scikit_learn

然而,当我以40张手表图片开始运行此程序时,import cv2 import numpy as np import os from glob import glob from sklearn.decomposition import IncrementalPCA from sklearn import neighbors from sklearn import preprocessing data = [] # Read images from file # for filename in glob('Watches/*.jpg'): img = cv2.imread(filename) height, width = img.shape[:2] img = np.array(img) # Check that all my images are of the same resolution if height == 450 and width == 450: # Reshape each image so that it is stored in one line img = np.concatenate(img, axis=0) img = np.concatenate(img, axis=0) data.append(img) # Normalise data # data = np.array(data) Norm = preprocessing.Normalizer() Norm.fit(data) data = Norm.transform(data) # IncrementalPCA model # ipca = IncrementalPCA(n_components=6) length = len(data) chunk_size = 4 pca_data = np.zeros(shape=(length, ipca.n_components)) for i in range(0, length // chunk_size): ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size]) pca_data[i * chunk_size: (i + 1) * chunk_size] = ipca.transform(data[i*chunk_size : (i+1)*chunk_size]) # K-Nearest neighbours # knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data) distances, indices = knn.kneighbors(data) print(indices) 时出现以下错误:

i = 1

但是,很明显我在编码ValueError: Number of input features has changed from 4 to 6 between calls to partial_fit! Try setting n_components to a fixed value. 时将n_components设置为6,但出于某种原因ipca = IncrementalPCA(n_components=6)ipca视为chunk_size = 4时的组件数量i = 0 }然后当i = 1更改为6时。

为什么会这样?

我该如何解决?

1 个答案:

答案 0 :(得分:2)

这似乎遵循PCA背后的数学原因,因为n_components > n_samples会对它产生不良影响。

您可能有兴趣阅读this(错误消息的介绍)和some discussion behind it

尝试增加批量大小/块大小(或降低n_components)。

(总的来说,我对这种方法也持怀疑态度。我希望你使用batch-PCA对一些小的示例数据集进行测试。看起来你的手表在几何方面没有经过预处理:裁剪;也许hist- /彩色正规化。)