我有来自两位作家的手写样本。我正在使用特征提取器从两者中提取特征。
我想显示类之间的相似性。至于显示分类器正确分类它们的相同程度和难度有多大。
我读过使用PCA来证明这一点的论文。我试过PCA,但我不认为我正确地实现了这一点。我用它来显示相似性。
[COEFF,SCORE] = princomp(features_extracted);
plot(COEFF,'.')
但是对于每个班级和每个样本我得到完全相同的情节。我的意思是他们应该是相似的并不完全一样。我做错了什么?
答案 0 :(得分:2)
如果每个类只有10个样本,并且超过4000个特征,那么您将很难显示任何重要内容。
尽管如此,以下代码将计算PCA并显示前两个主要组件(包含“最多”方差)的组件之间的关系。
% Truly indistinguishable data
dummy_data = randn(20, 4000);
% Uncomment this to make the data distinguishable
%dummy_data(1:10, :) = dummy_data(1:10, :) - 0.5;
% Normalise the data - this isn't technically required for the dummy data
% above, but is included for completeness.
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of 10 0's and 10 1's
class_labels = reshape(repmat([0 1], 10, 1), 20, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')
对于难以区分的数据,这将显示类似于以下内容的散点图:
显然,不可能在两个类之间绘制分隔边界。
如果您取消注释第五行以使数据可以区分,那么该图将改为如下:
但是,为了重复我在评论中所写的内容,PCA 不必然会找到能够实现最佳分离的组件。它是一种无监督方法,只能找到方差最大的组件。在某些应用中,这也是提供良好分离的组件。每个级别只有10个样本,您将无法证明具有统计意义的任何内容。另请查看this question以获取有关PCA和每个班级样本数量的更多详细信息。
编辑:这也很自然地延伸到了更多的课程:
numer_of_classes = 10;
samples_per_class = 20;
% Truly indistinguishable data
dummy_data = randn(numer_of_classes * samples_per_class, 4000);
% Make the data distinguishable
for i = 1:numer_of_classes
ixd = (((i - 1) * samples_per_class) + 1):(i * samples_per_class);
dummy_data(ixd, :) = dummy_data(ixd, :) - (0.5 * (i - 1));
end
% Normalise the data
dummy_data_normalised = dummy_data;
for f = 1:size(a, 2)
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) - nanmean(dummy_data_normalised(:, f));
dummy_data_normalised(:, f) = dummy_data_normalised(:, f) / nanstd(dummy_data_normalised(:, f));
end
% Generate vector of classes (1 to numer_of_classes)
class_labels = reshape(repmat(1:numer_of_classes, samples_per_class, 1), numer_of_classes * samples_per_class, 1);
% Perform PCA
pca_coeffs = pca(dummy_data_normalised);
% Calculate transformed data
dummy_data_pca = dummy_data_normalised * pca_coeffs;
figure;
hold on;
for class = unique(class_labels)'
% Plot first two components of first class
scatter(dummy_data_pca(class_labels == class, 1), dummy_data_pca(class_labels == class, 2), 'filled')
end
legend(strcat({'Class '},int2str(unique(class_labels)))')