如何删除异常值

时间:2019-12-24 22:05:43

标签: python-3.x machine-learning outliers

我设法很好地应用了四分位数范围原理,但是当我显示数据集的胡子框而没有离群值时,我看到总是存在离群值。怎么了? 这是代码:

# Load libraries
import pandas as pd;
from pandas import read_csv, set_option;
from matplotlib import pyplot as plt;

# Load dataset
filename         = "/home/fogang/dataset/Regression/Housing Boston/housing.csv";
df               = read_csv(filename, header=0);
df = df.drop('Unnamed: 0', axis=1);  # Let's delete the column 'Unnamed: 0'
one_dim         = pd.DataFrame();
one_dim['rm']    = df['rm'];

#shape dataset
print(one_dim.shape);

# Peek at dataset
print(one_dim.head(10));

# Let's look whether there are NaN values
print(one_dim.isnull().sum());

# Box and whisker plots
one_dim.plot(kind='box', subplots=True, layout=(1, 1), sharex=False, sharey=False, fontsize=12);
plt.show();

# Describe Dataset
print(one_dim.describe());

# Let's find Inter-Quartile Range
unidim        = one_dim['rm'];
unidim_Q1     = unidim.quantile(0.25);
unidim_Q3     = unidim.quantile(0.75);
unidim_IQR    = unidim_Q3 - unidim_Q1;
unidim_lower  = unidim_Q1 - (1.5 * unidim_IQR);
unidim_upper  = unidim_Q3 + (1.5 * unidim_IQR);

# Outliers
unidim_outliers  = pd.DataFrame();
unidim_outliers['outliers'] = unidim[(unidim < unidim_lower) | (unidim > unidim_upper)]
unidim_outliers.info()

# Good data
unidim_good  = pd.DataFrame();
unidim_good['good'] = unidim[(unidim >= unidim_lower) & (unidim <= unidim_upper)];
unidim_good.info();

unidim_good.plot(kind='box', subplots=True, layout=(1, 2), sharex=False, sharey=False, fontsize=12);
plt.show();

该怎么办?

1 个答案:

答案 0 :(得分:0)

您的上下两端分布的离群值太宽。因此,然后您切出了一些离群值,然后再次进行检查,则在切出的数据中有新的离群值。 如果您想完全消除一次剪切的离群值,则可以使用更严格的剪切规则来进行剪切,例如:

unidim_lower  = unidim_Q1 - (1.3 * unidim_IQR);
unidim_upper  = unidim_Q3 + (1.3 * unidim_IQR);

但是我要警告您:并非所有“异常值”都对模型不利,您应该明智地选择将什么视为“标称值”以及无论如何都是有用的数据。