如何删除异常值

时间:2019-02-13 05:43:44

标签: python scikit-learn outliers

我正在研究回归问题。我有10个自变量。我正在使用SVR。尽管进行了功能选择和使用网格搜索调整SVR参数,但我得到了15%的巨大MAPE。因此,我尝试删除异常值,但是在删除异常值之后,我无法拆分数据。我的问题是,离群值是否会影响回归的准确性?

from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import Normalizer
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV


def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

import pandas as pd
from sklearn import preprocessing
features=pd.read_csv('selectedData.csv')
target = features['SYSLoad']
features= features.drop('SYSLoad', axis = 1)


from scipy import stats
import numpy as np
z = np.abs(stats.zscore(features))
print(z)
threshold = 3
print(np.where(z > 3))
features2 = features[(z < 3).all(axis=1)]


from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(features2, target, test_size = 0.25, random_state = 42)  

在执行以下代码时出现此错误。

  

“样本:%r”%[长度为l的int(l)])

     

ValueError:找到数量不一致的输入变量   样本:[33352,35064]“

1 个答案:

答案 0 :(得分:1)

您收到错误消息是因为,由于以下原因,您的target变量与features的长度相等(大概为35064),原因是:

target = features['SYSLoad']

您的features2变量的长度较短(大概是33352),即由于以下原因,它是features子集

features2 = features[(z < 3).all(axis=1)]

和您的train_test_split合理地抱怨特征和标签的长度不相等。

因此,您还应该相应地将target子集化,并在target2中使用此train_test_split

target2 = target[(z < 3).all(axis=1)]
train_input, test_input, train_target, test_target = train_test_split(features2, target2, test_size = 0.25, random_state = 42)