我正在对包含数字列的数据帧执行min-max-scaler操作,但如果在这些数字列中,如果任何单元格包含字符串或空值,那么我将获得异常。 为了避免这种情况,我认为将字符串或空单元格转换为0。 怎么做? 我的职责:
def min_max_scaler(df_sub,col_names):
"""
import the following:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
df_sub : Expecting a subset of data frame in which every columns should be number fields
(It contains all the columns on which you want to perform the operation)
example : df_subset = df.filter(['latitude','longitude','order.id'], axis=1)
col_names : All column names of the subset
"""
scaler = preprocessing.MinMaxScaler()
scaled_df = scaler.fit_transform(df_sub)
scaled_df = pd.DataFrame(scaled_df, columns=col_names)
return scaled_df
数据集:
day phone_calls received
7 180 NaN
8 8 NaN
9 -240 qbb
如何在执行此功能之前进行验证。请帮助。
答案 0 :(得分:3)
我这样做:
找到object
dtype的列:
obj_cols = df[col_names].columns[df[col_names].dtypes.eq('object')]
将它们转换为数字dtypes,将NaN替换为0
(零):
df[obj_cols] = df[obj_cols].apply(pd.to_numeric, errors='coerce').fillna(0)
规模:
df[obj_cols] = scaler.fit_transform(df[obj_cols])
作为一个功能:
def min_max_scaler(df_sub,col_names):
scaler = preprocessing.MinMaxScaler()
obj_cols = df_sub[col_names].columns[df_sub[col_names].dtypes.eq('object')]
df_sub[obj_cols] = df_sub[obj_cols].apply(pd.to_numeric, errors='coerce').fillna(0)
return df_sub