我正在尝试聚类Matrix(大小:20057x2)。:
T = clusterdata(X,cutoff);
但是我收到了这个错误:
??? Error using ==> pdistmex Out of memory. Type HELP MEMORY for your options. Error in ==> pdist at 211 Y = pdistmex(X',dist,additionalArg); Error in ==> linkage at 139 Z = linkagemex(Y,method,pdistArg); Error in ==> clusterdata at 88 Z = linkage(X,linkageargs{1},pdistargs); Error in ==> kmeansTest at 2 T = clusterdata(X,1);有人可以帮助我。我有4GB的内存,但认为问题来自其他地方..
答案 0 :(得分:13)
正如其他人所提到的,层次聚类需要计算成对距离矩阵,该矩阵太大而不适合你的情况。
尝试使用K-Means算法:
numClusters = 4;
T = kmeans(X, numClusters);
或者,您可以选择数据的随机子集,并将其用作聚类算法的输入。接下来,将聚类中心计算为每个聚类组的平均值/中值。最后,对于未在子集中选择的每个实例,您只需计算其与每个质心的距离,并将其分配给最近的一个。
以下是一个示例代码,用于说明上述想法:
%# random data
X = rand(25000, 2);
%# pick a subset
SUBSET_SIZE = 1000; %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);
%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3)); %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) ); %# number of clusters found
%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])
%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight
%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
centers(:,i) = accumarray(C, data(:,i), [], @mean);
end
%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);
%#clustIDX( ind(1:SUBSET_SIZE) ) = C;
%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight
答案 1 :(得分:2)
X
在32位机器上太大了。 pdist
试图制作一个201,131,596行向量(clusterdata
使用pdist
)的双精度数,这将耗尽约1609MB(double
为8个字节)...如果你使用/ 3GB开关在Windows下运行它,您的最大矩阵大小限制为1536MB(参见here)。
您需要将数据分开,而不是一次性直接对所有数据进行聚类。
答案 2 :(得分:1)
PDIST计算所有可能的行对之间的距离。如果您的数据包含N = 20057行,那么对的数量将为N *(N-1)/ 2,在您的情况下为201131596。你的机器可能太多了。