我只想从数据集中删除非常极端的离群值。我使用以下代码绘制了异常值:
def percentile_based_outlier(data, threshold):
diff = (100 - threshold) / 2
minval, maxval = np.percentile(data, [diff, 100 - diff])
return (data < minval) | (data > maxval)
col_names = ['Pts/75', 'PlayVal', 'FTA/100', 'rDREB%']
threshold = 99.5
fig, ax = plt.subplots(len(col_names), figsize=(8,40))
for i, col_val in enumerate(col_names):
x = Data[col_val][:1346]
sns.distplot(x, ax=ax[i], rug=True, hist=False)
outliers = x[percentile_based_outlier(x, threshold)]
ax[i].plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)
ax[i].set_title('Outlier detection - {}'.format(col_val), fontsize=15)
ax[i].set_xlabel(col_val, fontsize=10)
plt.show()
现在,如何在生成图后修改代码以消除异常值?如果不可能,是否有其他替代方法,例如单独的函数以相同的阈值参数去除异常值? (我还有其他要用不同阈值清洗的列)