如何检测和删除大型数据集中的异常值?

时间:2016-11-01 03:29:46

标签: matlab machine-learning pattern-matching octave outliers

假设我手中有以下数据集(google drive link),

最左边的列代表卡片的类型/类别(俱乐部,黑桃,钻石,心形)。其余的列是特征(胡时刻)。

 4.00000000e+000 1.81685834e-001 9.69817396e-006 1.38999809e-003 4.53935830e-006 -3.00925971e-010 -1.02459512e-008 -1.98644904e-010
 4.00000000e+000 1.84243083e-001 1.38222459e-005 1.40735374e-003 5.54632513e-006 -4.43889193e-010 -1.56489028e-008 -2.07550859e-010
 4.00000000e+000 1.82590649e-001 5.79561254e-005 1.39540810e-003 5.08169772e-006 -1.84162373e-010 -6.28655267e-009 -3.86265224e-010
 ... ... ... ... ... ... ... ... ...
 3.00000000e+000 1.82603791e-001 9.40113955e-005 2.03910312e-003 8.28822917e-006 -1.07466686e-009 -7.79983939e-008 7.79123931e-011
 3.00000000e+000 1.83689464e-001 1.04402426e-005 2.03314034e-003 8.07647097e-006 -1.01550111e-009 -1.72512940e-008 1.99657322e-010
 3.00000000e+000 1.80537920e-001 3.57786643e-005 1.76044988e-003 4.93065630e-006 -4.50792164e-010 -2.52193544e-008 8.83931179e-011
 ... ... ... ... ... ... ... ... ...
 2.00000000e+000 1.69366341e-001 1.04327615e-003 1.34561560e-006 8.41412130e-008 9.60997904e-015 2.07709872e-009 -2.66313560e-014
 2.00000000e+000 1.70623294e-001 1.52567078e-003 2.33145414e-005 1.91976774e-006 1.28281112e-011 7.49218536e-008 -6.30393351e-013
 2.00000000e+000 1.71039727e-001 1.75199006e-003 3.56406516e-007 2.25222892e-008 -1.80796663e-016 8.75703034e-010 -2.00974686e-015
 ... ... ... ... ... ... ... ... ...
 1.00000000e+000 2.03297227e-001 4.88342633e-004 2.30244914e-003 2.76274577e-006 -1.62641080e-010 -5.06416340e-008 -1.48662421e-010
 1.00000000e+000 2.02575326e-001 3.16058139e-004 2.03933434e-003 4.34776729e-007 -1.26636446e-011 -7.63543121e-009 2.69021091e-012
 1.00000000e+000 2.02239287e-001 3.21962233e-004 1.94963577e-003 1.92362659e-006 -2.34173299e-011 -1.78153951e-008 1.15452477e-010
 1.00000000e+000 2.02709157e-001 2.28613647e-004 1.89761073e-003 1.09923103e-006 1.25239064e-011 -3.87194855e-009 4.86166479e-011
 1.00000000e+000 1.99640647e-001 1.80163318e-004 1.66091127e-003 3.40914582e-007 6.26687530e-012 7.47151809e-010 5.15120878e-012
 ... ... ... ... ... ... ... ... ...
 4.00000000e+000 1.94974773e-001 1.02770938e-003 3.32021924e-005 7.56951250e-005 -3.21487967e-009 2.42373008e-006 -2.01613839e-009
 4.00000000e+000 1.91031757e-001 1.04421581e-003 1.30233680e-005 5.48067243e-005 -1.41634644e-009 1.76666840e-006 3.71433852e-010
 4.00000000e+000 1.94861863e-001 9.86215578e-004 4.27892747e-005 7.04495953e-005 -3.50245985e-009 2.21146739e-006 -1.64137532e-009
  ... ... ... ... ... ... ... ... ...
  • 哪种离群检测和去除方法最适合 这类数据?
  • 如何检测并删除该数据集中的异常值?

修改

我的老师写了这个,

load train.txt
load test.txt

% comparing mean and median values
[mean(train); median(train)]

% You can compare here different parameters computed on train 
% and test sets - they should be roughly the same

% plot histogram - to check histogram plotting first show labels (1..4)
% we can use hist for just one dimension
hist(train(:,1))

% now plot histogram of the first feature
hist(train(:,2))

% plot 2-dimensional diagram of the first two features
% it's good to repeat plotting after each modification of the training set
plot2features(train, 2, 3);

% to find row in which outlier sits
[mv mi] = max(train)

% to remove outlier from the training set
train(186,:)=[];

% to find row in which outlier sits
[mv mi] = min(train)

% to remove outlier from the training set
train(641,:)=[];

我无法理解他做了什么以及他为什么这样做。

1 个答案:

答案 0 :(得分:0)

这是用于异常值检测和删除的MATLAB代码,

function mat = removeOutlier(train)
    [mv mi] = max(train(:,2:end));
    x = mode(mv);
    train(x, :)=[];
    [mv mi] = min(train(:,2:end));
    x = mode(mv);
    train(x, :)=[];
    mat = train;