在Python中的一些Dataframe列上使用Imputer

时间:2016-07-26 07:59:20

标签: python scikit-learn missing-data

我正在学习如何在Python上使用Imputer。

这是我的代码:

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df["price"])

df["price"]=imp.transform(df["price"])

然而,这会引发以下错误: ValueError:值的长度与索引的长度

不匹配

我的代码有什么问题???

感谢您的帮助

4 个答案:

答案 0 :(得分:13)

这是因为Imputer通常用于DataFrames而不是Series。可能的解决方案是:

imp=Imputer(missing_values="NaN", strategy="mean" )
imp.fit(df[["price"]])
df["price"]=imp.transform(df[["price"]]).ravel()

# Or even 
imp=Imputer(missing_values="NaN", strategy="mean" )
df["price"]=imp.fit_transform(df[["price"]]).ravel()

答案 1 :(得分:2)

我认为您要为imputer指定轴,然后转置它返回的数组:

import pandas as pd
import numpy as np

df=pd.DataFrame([["XXL", 8, "black", "class 1", 22],
["L", np.nan, "gray", "class 2", 20],
["XL", 10, "blue", "class 2", 19],
["M", np.nan, "orange", "class 1", 17],
["M", 11, "green", "class 3", np.nan],
["M", 7, "red", "class 1", 22]])

df.columns=["size", "price", "color", "class", "boh"]

from sklearn.preprocessing import Imputer

imp=Imputer(missing_values="NaN", strategy="mean",axis=1 ) #specify axis
q = imp.fit_transform(df["price"]).T #perform a transpose operation


df["price"]=q
print df 

答案 2 :(得分:1)

简单的解决方案是提供2D阵列

balloon 65536

这是您的DataFrame

https://aws.amazon.com/blogs/compute/clean-up-your-container-images-with-amazon-ecr-lifecycle-policies/

答案 3 :(得分:0)

这里是 Simple Imputer 的文档。对于fit方法,它采用类似数组或稀疏的metrix作为输入参数。 您可以尝试:

imp.fit(df.iloc[:,1:2]) 
df['price']=imp.transform(df.iloc[:,1:2])

提供索引位置以适合方法,然后应用转换。

>>> df
   size  price   color    class   boh
 0  XXL    8.0   black  class 1  22.0
 1    L    9.0    gray  class 2  20.0
 2   XL   10.0    blue  class 2  19.0
 3    M    9.0  orange  class 1  17.0
 4    M   11.0   green  class 3   NaN
 5    M    7.0     red  class 1  22.0

您可以为boh

做同样的事情
imp.fit(df.iloc[:,4:5])
df['price']=imp.transform(df.iloc[:,4:5])
>>> df
    size  price   color    class   boh
 0  XXL    8.0   black  class 1  22.0
 1    L    9.0    gray  class 2  20.0
 2   XL   10.0    blue  class 2  19.0
 3    M    9.0  orange  class 1  17.0
 4    M   11.0   green  class 3  20.0
 5    M    7.0     red  class 1  22.0

如果我错了,请纠正我。建议将不胜感激。