群集数量

Question

您好我正在努力尝试从1999 darpa数据集中聚类网络数据。不幸的是，我并没有真正得到集群数据，与一些文献相比，使用相同的技术和方法。

我的数据是这样的：

Matlab Figure 1

正如您所看到的，它不是很集群。这是由于数据集中存在大量异常值（噪声）。我已经看过一些异常值去除技术但到目前为止我没有尝试过任何真正清除数据的方法。我尝试过的方法之一：

%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:

    mu = mean(data)
    sigma = std(data)
    [n,p] = size(data);
    % Create a matrix of mean values by replicating the mu vector for n rows
    MeanMat = repmat(mu,n,1);
    % Create a matrix of standard deviation values by replicating the sigma vector for n rows
    SigmaMat = repmat(sigma,n,1);
    % Create a matrix of zeros and ones, where ones indicate the location of outliers
    outliers = abs(data - MeanMat) > 3*SigmaMat;
    % Calculate the number of outliers in each column
    nout = sum(outliers) 
    % To remove an entire row of data containing the outlier
    data(any(outliers,2),:) = [];

在第一次运行中，它从1000个标准化随机行中删除了48行，这些行是从完整数据集中选择的。

这是我在数据上使用的完整脚本：

    %% load data
        %# read the list of features
        fid = fopen('kddcup.names','rt');
        C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
        fclose(fid);

        %# determine type of features
        C{2} = regexprep(C{2}, '.$','');              %# remove "." at the end
        attribNom = [ismember(C{2},'symbolic');true]; %# nominal features

        %# build format string used to read/parse the actual data
        frmt = cell(1,numel(C{1}));
        frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
        frmt( ismember(C{2},'symbolic') ) = {'%s'};   %# nominal features: read as string
        frmt = [frmt{:}];
        frmt = [frmt '%s'];                           %# add the class attribute

        %# read dataset
        fid = fopen('kddcup.data_10_percent_corrected','rt');
        C = textscan(fid, frmt, 'Delimiter',',');
        fclose(fid);

        %# convert nominal attributes to numeric
        ind = find(attribNom);
        G = cell(numel(ind),1);
        for i=1:numel(ind)
            [C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
        end

        %# all numeric dataset
        fulldata = cell2mat(C);

%% dimensionality reduction 
columns = 6
[U,S,V]=svds(fulldata,columns);

%% randomly select dataset
rows = 1000;
columns = 6;

%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';

%# pick random columns
indY = indY(1:columns);

%# filter data
data = U(indX,indY);

% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));

% output matching data
dataSample = fulldata(indX, :)

%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:

mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers) 
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];

%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;

%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);

%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);

这是输出中的两个不同的集群：

enter image description here

正如您所看到的，数据看起来比原始数据更清晰，更集群。但是我仍然认为可以使用更好的方法。

例如，观察整体聚类，我仍然有很多来自数据集的噪音（异常值）。从这里可以看出：

enter image description here

我需要将异常值行放入单独的数据集中以便以后分类（仅从群集中删除）

这是darpa数据集的链接，请注意10％的数据集已经显着减少了列，大多数已经删除了0或1的列已经被删除（42列到6列）：

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

修改

数据集中保留的列是：

src_bytes: continuous.

dst_bytes: continuous.

count: continuous.

srv_count: continuous.  

dst_host_count: continuous.

dst_host_srv_count: continuous.

RE-编辑：

根据与Anony-Mousse及其答案的讨论，可能有一种方法可以使用K-Medoids http://en.wikipedia.org/wiki/K-medoids来降低聚类中的噪音。我希望我目前拥有的代码没有太大的变化，但是我还不知道如何实现它来测试这是否会显着降低噪音。因此，假设某人可以向我展示一个有效的例子，这将被接受为答案。

Answer 1

请注意，使用此数据集不鼓励：

该数据集有错误：KDD Cup '99 dataset (Network Intrusion) considered harmful

使用不同的算法重新考虑。 k-means并不适用于混合型数据，其中许多属性离散，并且具有非常不同的比例。 K-means需要能够计算明智的手段。而对于二元矢量“0.5”不是一个明智的平均值，它应该是0或1。

另外，k-means不太喜欢异常值。

绘图时，请确保均匀缩放，否则结果看起来不正确。你的X轴长度约为0.9，你的y轴只有0.2 - 难怪它们看起来被压扁了。

总的来说，也许数据集没有k-means风格的集群？你肯定应该尝试基于密度的方法（因为这些可以处理异常值，如DBSCAN。但从你添加的可视化来看，我会说它最多有4-5个集群，而且它们并不是很有趣。它们可能在某些维度上被捕获了许多阈值。

Parallel coordinates

这是z-标准化后的数据集的可视化，以平行坐标可视化，具有5000个样本。鲜绿色是正常的。

您可以清楚地看到数据集的特殊属性。所有攻击在属性3和4（count和srv_count）中明显不同，并且最集中在dst_host_count和dst_host_srv_count。

我也在这个数据集上运行了OPTICS。它发现了许多星团，其中大部分都是葡萄酒色的攻击模式。但他们并不是很有趣。如果你有10个不同的主机ping泛洪，它们将形成10个集群。

OPTICS Clusters

您可以很好地看到OPTICS设法将这些攻击集中在一起。它错过了所有橙色的东西（也许如果我将碎片设置得更低，它会分散开来），但它甚至在葡萄酒色的攻击中发现了*结构），将其分解为许多单独的事件。

要真正了解此数据集，您应该从功能提取开始，例如将此类ping泛滥连接尝试合并到聚合事件< / em>的

另请注意这是不切实际的情况。

攻击中涉及众所周知的模式，特别是端口扫描。最好用专门的端口扫描检测器检测这些，而不是学习。

模拟数据有很多完全没有意义的“攻击”模拟。例如，来自90年代的Smurf attack是> 50％的数据集，而Syn flood是另外的20％;正常流量<20％！

对于这类攻击，有众所周知的签名。

许多现代攻击（例如SQL注入）都会以通常的HTTP流量流动，并且不会在原始流量模式中出现异常。

只是不要将此数据用于分类或异常值检测。只是不要。

引用上面的KDNuggets链接：


因此，我们强烈建议

（1）所有研究人员停止使用KDD Cup '99数据集，

（2）KDD杯和UCI网站在KDD杯'99数据集网页上发出警告，告知研究人员数据集存在已知问题，

（3）会议和期刊的同行评审员（甚至直接拒绝他们，这在网络安全社区中很常见），结果仅来自KDD Cup '99数据集。

这既不是真实也不是现实数据。去获取其他东西。

Answer 2

首先要做的事情是：你在这里要求很多。供将来参考：尝试以较小的块分解您的问题，并发布几个问题。这会增加你获得答案的机会（并且不会让你获得400点声望！）。

幸运的是，我理解你的困境，只是喜欢这样的问题！

除了这个数据集可能存在k-means的问题之外，这个问题仍然足够通用，也可以应用于其他数据集（因此Google员工最终会在这里寻找类似的东西），所以让我们继续解决这个问题。 / p>

我的建议是我们编辑这个答案，直到你得到相当满意的结果。

群集数量

任何群集问题的第1步：要选择多少个群集？我知道有几种方法可以选择适当数量的聚类。关于此问题有一个很好的wiki page，其中包含以下所有方法（以及其他一些方法）。

目视检查

这可能看起来很愚蠢，但是如果你有完全分开的数据，一个简单的情节可以告诉你（大约）你需要多少个集群，只需看一下。

优点：

快速
简单
适用于相对较小的数据集中分离良好的群集

缺点：

和脏
需要用户互动
很容易错过较小的群集
通过这种方法很难将数据分离较少的群集或很多群体的数据
这一切都相当主观 - 下一个人可能会选择与你不同的金额。

剪影情节

如您的other questions之一所示，制作silhouettes图表可帮助您更好地决定数据中正确的群集数量。

优点：

相对简单
通过使用统计测量来降低主观性
直观的方式来表示选择的质量

缺点：

需要用户互动
在限制中，如果您采用与数据点一样多的聚类，剪影图将告诉您是最佳选择
它仍然相当主观，不是基于统计手段
的计算成本可能很高

肘法

与剪影图方法一样，您反复运行kmeans，每次都有更多的聚类，您可以看到数据中的总方差有多少由此{{ 1}}跑。将存在许多聚类，其中解释的方差量将突然增加少于先前任何聚类数量（“肘”）的选择。从统计学上讲，肘部是群集数量的最佳选择。

优点：

无需用户交互 - 可以自动选择肘部
统计上比任何上述方法更健全

缺点：

有点复杂
仍然主观，因为“肘部”的定义取决于主观选择的参数
的计算成本可能很高

离群值

使用上述任何方法选择群集数量后，就可以进行离群值检测，以查看群集的质量是否有所改善。

我将从两步迭代方法开始，使用elbow方法。在伪Matlab中：

kmeans

困难的部分显然是确定data = your initial dataset dataMod = your initial dataset MAX = the number of clusters chosen by visual inspection while (forever) for N = MAX-5 : MAX+5 if (N < 1), continue, end perform k-means with N clusters on dataMod if (variance explained shows a jump) break end if (you are satisfied) break end for i = 1:N extract all points from cluster i find the centroid (let k-means do that) calculate the standard deviation of distances to the centroid mark points further than 3 sigma as possible outliers end dataMod = data with marked points removed end。这是算法有效性的关键。粗糙的结构这部分

you are satisfied

会是这样的

if (you are satisfied)
    break
end

if (situation has improved) data = dataMod elseif (situation is same or worse) dataMod = data break end当异常值较少或方差较小时解释situation has improved的所有选择比N中的上一循环更好。这也是一个可以解决的问题。

无论如何，我不会称之为第一次尝试。如果有人在这里看到不完整，缺陷或漏洞，请评论或编辑。

k-means聚类，异常值去除的不规则图

2 个答案:

群集数量

目视检查

剪影情节

肘法

离群值