我试图使用线性回归来估算pandas数据框中的缺失值
`
for index in [missing_data_df.horsepower.index]:
i = 0
if pd.isnull(missing_data_df.horsepower[index[i]]):
#linear regression equation
a = 0.25743277 * missing_data_df.displacement[index[i]] + 0.00958711 *
missing_data_df.weight[index[i]] + 25.874947903262651
# replacing "nan" values in dataframe using .set_value
missing_data_df.set_value(index[i],"horsepower",a)
i+=1
`
它正在执行。但是数据框中的缺失值(nan)没有被变量'a'中的线性回归的预测值替代。有什么建议吗?
是包含缺失数据的数据框 `
>>> missing_data_df:
mpg cylinders displacement horsepower weight acceleration \
10 NaN 4.0 133.0 115.0 3090.0 17.5
11 NaN 8.0 350.0 165.0 4142.0 11.5
12 NaN 8.0 351.0 153.0 4034.0 11.0
13 NaN 8.0 383.0 175.0 4166.0 10.5
14 NaN 8.0 360.0 175.0 3850.0 11.0
17 NaN 8.0 302.0 140.0 3353.0 8.0
38 25.0 4.0 98.0 NaN 2046.0 19.0
39 NaN 4.0 97.0 48.0 1978.0 20.0
133 21.0 6.0 200.0 NaN 2875.0 17.0
337 40.9 4.0 85.0 NaN 1835.0 17.3
343 23.6 4.0 140.0 NaN 2905.0 14.3
361 34.5 4.0 100.0 NaN 2320.0 15.8
367 NaN 4.0 121.0 110.0 2800.0 15.4
382 23.0 4.0 151.0 NaN 3035.0 20.5
model_year origin car_name
10 70.0 2.0 citroen ds-21 pallas
11 70.0 1.0 chevrolet chevelle concours (sw)
12 70.0 1.0 ford torino (sw)
13 70.0 1.0 plymouth satellite (sw)
14 70.0 1.0 amc rebel sst (sw)
17 70.0 1.0 ford mustang boss 302
38 71.0 1.0 ford pinto
39 71.0 2.0 volkswagen super beetle 117
133 74.0 1.0 ford maverick
337 80.0 2.0 renault lecar deluxe
343 80.0 1.0 ford mustang cobra
361 81.0 2.0 renault 18i
367 81.0 2.0 saab 900s
382 82.0 1.0 amc concord dl
`
答案 0 :(得分:1)
您可以使用apply和lambda:
missing_data_df['horsepower']= missing_data_df.apply(
lambda row:
0.25743277 * row.displacement + 0.00958711 * row.weight + 25.874947903262651
if np.isnan(row.horsepower) else row.horsepower, axis=1)
答案 1 :(得分:0)
有几件事
要计算体重,请尝试
for idx in missing_data_df.index:
if pd.isnull(missing_data_df.loc[idx,"weight"]):
disp = missing_data_df.loc[idx,"displacement"]
hp = missing_data_df.loc[idx,"horsepower"]
missing_data_df.loc[idx,"weight"] = (0.25743277 * disp + 25.874947903262651 - hp) / -0.00958711
通常,.loc[]
和.iloc[]
是查找或设置值时更好的方法