我正在学习如何建立一个简单的线性模型,以根据其平方米和房间数找到平价。我有一个.csv数据集有几个功能,当然'Price'是其中之一,但它包含几个可疑值,如'1'或'4000'。我想根据平均值和标准偏差删除这些值,因此我使用以下函数来删除异常值:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
然后我构建函数来构建线性回归:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
下一步是定义数据/预测变量。我设置了我的功能:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
这是我的问题。如何为我的X和Y获得相同的观察结果?现在我的样本数量不一致(我的X为5000,删除异常值后为我的Y为4995)。感谢您对本主题的任何帮助。
答案 0 :(得分:1)
功能和标签应具有相同的长度
并且您应该将整个数据对象传递给reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
您可以这样使用它:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target
答案 1 :(得分:1)
以下适用于Pandas DataFrames(数据):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered