使用EM算法,我想在给定数据集上训练具有四个分量的高斯混合模型。该集是三维的,包含300个样本。
问题在于,在大约6轮EM算法之后,根据matlab(rank(sigma) = 2
而不是3),协方差矩阵sigma变得接近于奇异。这反过来会导致不希望的结果,例如评估高斯分布的复杂值gm(k,i)
。
此外,我使用高斯的对数来解释下溢问题 - 参见E步骤。我不确定这是否正确,如果我必须在其他地方采取责任p(w_k | x ^(i),theta)的exp?
你能告诉我到目前为止我的EM算法实现是否正确吗? 如何解决接近奇异协方差西格玛的问题?
以下是我对EM算法的实现:
首先,我使用kmeans 初始化组件的均值和协方差:
load('data1.mat');
X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components
% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components
% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
sigma(:,:,k) = cov(X(idx==k,:));
end
对于 E-step ,我使用以下公式计算责任。
w_k是k高斯分量。
x ^(i)是单个数据点(样本)
theta代表高斯混合模型的参数:mu,Sigma,pi。
以下是相应的代码:
% variables for convergence
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
round = round +1
gm = zeros(K,N); % gaussian component in the nominator
sumGM = zeros(N,1); % denominator of responsibilities
% E-step: Evaluate the responsibilities using the current parameters
% compute the nominator and denominator of the responsibilities
for k = 1:K
for i = 1:N
Xmu = X-mu;
% I am using log to prevent underflow of the gaussian distribution (exp("small value"))
logPdf = log(1/sqrt(det(sigma(:,:,k))*(2*pi)^D)) + (-0.5*Xmu*(sigma(:,:,k)\Xmu'));
gm(k,i) = log(p(k)) * logPdf;
sumGM(i) = sumGM(i) + gm(k,i);
end
end
% calculate responsibilities
res = zeros(K,N); % responsibilities
Nk = zeros(4,1);
for k = 1:K
for i = 1:N
% I tried to use the exp(gm(k,i)/sumGM(i)) to compute res but this leads to sum(pi) > 1.
res(k,i) = gm(k,i)/sumGM(i);
end
Nk(k) = sum(res(k,:));
end
Nk(k)
使用M步骤中给出的公式计算,并在M步骤中用于计算新概率p(k)
。
M-步骤
% M-step: Re-estimate the parameters using the current responsibilities
for k = 1:K
for i = 1:N
mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
end
mu(k,:) = mu(k,:)./Nk(k);
sigma(:,:,k) = sigma(:,:,k)./Nk(k);
p(k) = Nk(k)/N;
end
现在,为了检查收敛性,使用以下公式计算对数似然:
% Evaluate the log-likelihood and check for convergence of either
% the parameters or the log-likelihood. If not converged, go to E-step.
loglikelihood = 0;
for i = 1:N
loglikelihood = loglikelihood + log(sum(gm(:,i)));
end
% Check for convergence of parameters
errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
if (errorLoglikelihood <= eps)
converged = 1;
end
errorMu = abs(mu(:)-prevMu(:));
errorSigma = abs(sigma(:)-prevSigma(:));
errorPi = abs(p(:)-prevPi(:));
if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
converged = 1;
end
prevLoglikelihood = loglikelihood;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
end % while
我的Matlab实现高斯混合模型的EM算法有什么问题吗?