我设法很好地应用了四分位数范围原理,但是当我显示数据集的胡子框而没有离群值时,我看到总是存在离群值。怎么了? 这是代码:
# Load libraries
import pandas as pd;
from pandas import read_csv, set_option;
from matplotlib import pyplot as plt;
# Load dataset
filename = "/home/fogang/dataset/Regression/Housing Boston/housing.csv";
df = read_csv(filename, header=0);
df = df.drop('Unnamed: 0', axis=1); # Let's delete the column 'Unnamed: 0'
one_dim = pd.DataFrame();
one_dim['rm'] = df['rm'];
#shape dataset
print(one_dim.shape);
# Peek at dataset
print(one_dim.head(10));
# Let's look whether there are NaN values
print(one_dim.isnull().sum());
# Box and whisker plots
one_dim.plot(kind='box', subplots=True, layout=(1, 1), sharex=False, sharey=False, fontsize=12);
plt.show();
# Describe Dataset
print(one_dim.describe());
# Let's find Inter-Quartile Range
unidim = one_dim['rm'];
unidim_Q1 = unidim.quantile(0.25);
unidim_Q3 = unidim.quantile(0.75);
unidim_IQR = unidim_Q3 - unidim_Q1;
unidim_lower = unidim_Q1 - (1.5 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.5 * unidim_IQR);
# Outliers
unidim_outliers = pd.DataFrame();
unidim_outliers['outliers'] = unidim[(unidim < unidim_lower) | (unidim > unidim_upper)]
unidim_outliers.info()
# Good data
unidim_good = pd.DataFrame();
unidim_good['good'] = unidim[(unidim >= unidim_lower) & (unidim <= unidim_upper)];
unidim_good.info();
unidim_good.plot(kind='box', subplots=True, layout=(1, 2), sharex=False, sharey=False, fontsize=12);
plt.show();
该怎么办?
答案 0 :(得分:0)
您的上下两端分布的离群值太宽。因此,然后您切出了一些离群值,然后再次进行检查,则在切出的数据中有新的离群值。 如果您想完全消除一次剪切的离群值,则可以使用更严格的剪切规则来进行剪切,例如:
unidim_lower = unidim_Q1 - (1.3 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.3 * unidim_IQR);
但是我要警告您:并非所有“异常值”都对模型不利,您应该明智地选择将什么视为“标称值”以及无论如何都是有用的数据。