我的pandas数据包含一些文本类型的列。这些文本列中包含一些NaN值。我试图做的是用sklearn.preprocessing.Imputer
来代替那些NaN(以最常见的值取代NaN)。问题在于实施。
假设有一个包含30列的Pandas数据帧df,其中10列具有分类性质。
一旦我跑:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)
Python生成error: 'could not convert string to float: 'run1''
,其中' run1'是来自第一列的具有分类数据的普通(非缺失)值。
非常欢迎任何帮助
答案 0 :(得分:78)
要使用数字列的平均值和非数字列的最常用值,您可以执行以下操作。您可以进一步区分整数和浮点数。我想对整数列使用中位数可能是有意义的。
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
Columns of dtype object are imputed with the most frequent value
in column.
Columns of other types are imputed with mean of column.
"""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
data = [
['a', 1, 2],
['b', 1, 1],
['b', 2, 2],
[np.nan, np.nan, np.nan]
]
X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)
print('before...')
print(X)
print('after...')
print(xt)
打印,
before...
0 1 2
0 a 1 2
1 b 1 1
2 b 2 2
3 NaN NaN NaN
after...
0 1 2
0 a 1.000000 2.000000
1 b 1.000000 1.000000
2 b 2.000000 2.000000
3 b 1.333333 1.666667
答案 1 :(得分:4)
您可以将sklearn_pandas.CategoricalImputer
用于分类列。详细说明:
首先,(来自Scikit-Learn和TensorFlow的动手机器学习一书)您可以使用数字和字符串/分类功能的子管道,其中每个子管道的第一个变换器是一个选择器,其中列出了列名称(以及full_pipeline.fit_transform()
采用pandas DataFrame):
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
然后,您可以将这些子管道与sklearn.pipeline.FeatureUnion
组合在一起,例如:
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline)
])
现在,在num_pipeline
中,您只需使用sklearn.preprocessing.Imputer()
,但在cat_pipline
中,您可以使用CategoricalImputer()
包中的sklearn_pandas
。
注意: sklearn-pandas
包可以与pip install sklearn-pandas
一起安装,但导入为import sklearn_pandas
答案 2 :(得分:2)
复制和修改sveitser的答案,我为pandas.Series对象制作了一个imputer
import numpy
import pandas
from sklearn.base import TransformerMixin
class SeriesImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
If the Series is of dtype Object, then impute with the most frequent object.
If the Series is not of dtype Object, then impute with the mean.
"""
def fit(self, X, y=None):
if X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
else : self.fill = X.mean()
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
要使用它,你会这样做:
# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])
a = SeriesImputer() # Initialize the imputer
a.fit(s1) # Fit the imputer
s2 = a.transform(s1) # Get a new series
答案 3 :(得分:2)
受到这里的答案的启发以及对所有用例的goto Imputer的缺乏,我最终写了这篇文章。它支持四种估算策略mean, mode, median, fill
适用于pd.DataFrame
和Pd.Series
。
mean
和median
仅适用于数字数据,mode
和fill
适用于数字和分类数据。
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, strategy='mean',filler='NA'):
self.strategy = strategy
self.fill = filler
def fit(self, X, y=None):
if self.strategy in ['mean','median']:
if not all(X.dtypes == np.number):
raise ValueError('dtypes mismatch np.number dtype is \
required for '+ self.strategy)
if self.strategy == 'mean':
self.fill = X.mean()
elif self.strategy == 'median':
self.fill = X.median()
elif self.strategy == 'mode':
self.fill = X.mode().iloc[0]
elif self.strategy == 'fill':
if type(self.fill) is list and type(X) is pd.DataFrame:
self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
用法
>> df
MasVnrArea FireplaceQu
Id
1 196.0 NaN
974 196.0 NaN
21 380.0 Gd
5 350.0 TA
651 NaN Gd
>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea FireplaceQu
Id
1 196.0 Gd
974 196.0 Gd
21 380.0 Gd
5 350.0 TA
651 196.0 Gd
>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea FireplaceQu
Id
1 196.0 NA
974 196.0 NA
21 380.0 Gd
5 350.0 TA
651 0.0 Gd
答案 4 :(得分:2)
strategy ='most_frequent'只能用于定量功能,而不能用于定性功能。此定制冲床可用于定性和定量。同样,使用scikit Learn imputer,我们可以将其用于整个数据帧(如果所有特征都是定量的),或者可以将“ for loop”用于具有相似特征/列类型的列表(请参见以下示例)。但是自定义imputer可以与任何组合一起使用。
from sklearn.preprocessing import Imputer
impute = Imputer(strategy='mean')
for cols in ['quantitative_column', 'quant']: # here both are quantitative features.
xx[cols] = impute.fit_transform(xx[[cols]])
自定义入侵者:
from sklearn.preprocessing import Imputer
from sklearn.base import TransformerMixin
class CustomImputer(TransformerMixin):
def __init__(self, cols=None, strategy='mean'):
self.cols = cols
self.strategy = strategy
def transform(self, df):
X = df.copy()
impute = Imputer(strategy=self.strategy)
if self.cols == None:
self.cols = list(X.columns)
for col in self.cols:
if X[col].dtype == np.dtype('O') :
X[col].fillna(X[col].value_counts().index[0], inplace=True)
else : X[col] = impute.fit_transform(X[[col]])
return X
def fit(self, *_):
return self
数据框:
X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san
francisco', 'tokyo'],
'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'],
'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like',
'somewhat like', 'dislike'],
'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
city boolean ordinal_column quantitative_column
0 tokyo yes somewhat like 1.0
1 NaN no like 11.0
2 london NaN somewhat like -0.5
3 seattle no like 10.0
4 san francisco no somewhat like NaN
5 tokyo yes dislike 20.0
1)可以与功能相似的列表一起使用。
cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
cci.fit_transform(X)
可以与策略=中位数一起使用
sd = CustomImputer(['quantitative_column'], strategy = 'median')
sd.fit_transform(X)
3)可以用于整个数据框,它将使用默认均值(或者我们也可以使用中位数更改它。对于定性特征,它使用strategy ='most_frequent'以及定量均值/中位数。
call = CustomImputer()
call.fit_transform(X)
答案 5 :(得分:1)
此代码填写了最常见类别的系列:
import pandas as pd
import numpy as np
# create fake data
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan
print('m = ')
print(m)
#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0]
def replace_most_common(x):
if pd.isnull(x):
return most_common
else:
return x
new_m = m.map(replace_most_common) #apply function to original data
print('new_m = ')
print(new_m)
输出:
m =
0 a
1 NaN
2 c
3 a
dtype: object
new_m =
0 a
1 a
2 c
3 a
dtype: object
答案 6 :(得分:1)
有一个软件包sklearn-pandas
,该软件包可以为归类变量进行插补
https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer
>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)
答案 7 :(得分:0)
类似。修改String
的{{1}}:
Imputer
其中strategy='most_frequent'
找到每列最常用的值,然后class GeneralImputer(Imputer):
def __init__(self, **kwargs):
Imputer.__init__(self, **kwargs)
def fit(self, X, y=None):
if self.strategy == 'most_frequent':
self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
self.statistics_ = self.fills.values
return self
else:
return Imputer.fit(self, X, y=y)
def transform(self, X):
if hasattr(self, 'fills'):
return pd.DataFrame(X).fillna(self.fills).values.astype(str)
else:
return Imputer.transform(self, X)
用这些值填充缺失值。 pandas.DataFrame.mode()
仍然以相同的方式处理其他pandas.DataFrame.fillna()
值。
答案 8 :(得分:0)
您可以尝试以下操作:
"%__APPDIR__%timeout.exe"
答案 9 :(得分:0)
sklearn.impute.SimpleImputer而不是Imputer可以轻松解决此问题,该问题可以处理分类变量。
根据Sklearn文档: 如果为“ most_frequent”,则使用每一列中的最频繁值替换“ missing”。可以与字符串或数字数据一起使用。
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
impute_size=SimpleImputer(strategy="most_frequent")
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])
答案 10 :(得分:0)
Missforest 可用于对分类变量中的缺失值以及其他分类特征进行插补。它以类似于 IterativeImputer 的迭代方式工作,以随机森林为基础模型。
以下是标记编码特征和目标变量的代码,拟合模型以估算 nan 值,并将特征编码回来
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
def label_encoding(df, columns):
"""
Label encodes the set of the features to be used for imputation
Args:
df: data frame (processed data)
columns: list (features to be encoded)
Returns: dictionary
"""
encoders = dict()
for col_name in columns:
series = df[col_name]
label_encoder = LabelEncoder()
df[col_name] = pd.Series(
label_encoder.fit_transform(series[series.notnull()]),
index=series[series.notnull()].index
)
encoders[col_name] = label_encoder
return encoders
# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))