如何正确删除异常值并定义线性模型的预测变量?

时间:2018-01-05 12:52:17

标签: python pandas numpy scikit-learn

我正在学习如何建立一个简单的线性模型,以根据其平方米和房间数找到平价。我有一个.csv数据集有几个功能,当然'Price'是其中之一,但它包含几个可疑值,如'1'或'4000'。我想根据平均值和标准偏差删除这些值,因此我使用以下函数来删除异常值:

 import numpy as np
 import pandas as pd

 def reject_outliers(data):
    u = np.mean(data)
    s = np.std(data)
    data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return data_filtered

然后我构建函数来构建线性回归:

def linear_regression(data):
    data_filtered = reject_outliers(data['Price'])
    print(len(data)) # based on the lenght I see that several outliers have been removed 

下一步是定义数据/预测变量。我设置了我的功能:

features = data[['SqrMeters', 'Rooms']]
target = data_filtered

X = features
Y = target

这是我的问题。如何为我的X和Y获得相同的观察结果?现在我的样本数量不一致(我的X为5000,删除异常值后为我的Y为4995)。感谢您对本主题的任何帮助。

2 个答案:

答案 0 :(得分:1)

功能和标签应具有相同的长度

并且您应该将整个数据对象传递给reject_outliers:

def reject_outliers(data):
 u = np.mean(data["Price"])
 s = np.std(data["Price"])
 data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
 return data_filtered

您可以这样使用它:

data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']] 
target = data_filtered['Price']
X=features
y=target

答案 1 :(得分:1)

以下适用于Pandas DataFrames(数据):

def reject_outliers(data):
    u = np.mean(data.Price)
    s = np.std(data.Price)
    data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
    return data_filtered