我在python中,我尝试缩放到数据框
subject_id hour_measure urinecolor blood pressure
3 1.00 red 40
1.15 red high
4 2.00 yellow low
因为它包含数字和文本列 以下代码给我错误
#MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)
由于数据框包含字符串,这给了我错误,如何告诉python仅缩放包含数字的列,还缩放字符串列中的数值。
答案 0 :(得分:1)
将非数字值转换为缺失值,然后使用alternative solution进行缩放,最后将缺失值替换回原始值:
print (df)
subject_id hour_measure urinecolor blood pressure
0 3 1.00 red 40
1 3 1.15 red high
2 4 2.00 yellow low
3 5 5.00 yellow 100
df = df.set_index('subject_id')
df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())
df = df2.combine_first(df)
print (df)
hour_measure urinecolor blood pressure
subject_id
3 0.0000 red 0
3 0.0375 red high
4 0.2500 yellow low
5 1.0000 yellow 1
第一个解决方案:
我建议用字典将文本列替换为数字,例如:
dbp = {'high': 150, 'low': 60}
df['blood pressure'] = df['blood pressure'].replace(dbp)
一起:
#if subject_id are numeric convert them to index
df = df.set_index('subject_id')
dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)
print (df)
hour_measure urinecolor blood pressure
subject_id
3 1.00 red 40
3 1.15 red 150
4 2.00 yellow 60
print (df.dtypes)
hour_measure float64
urinecolor object
blood pressure int32
dtype: object
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0. 0. ]
[0.15 1. ]
[1. 0.18181818]]
详细信息:
print (df.select_dtypes(np.number))
hour_measure blood pressure
subject_id
3 1.00 40
3 1.15 150
4 2.00 60
答案 1 :(得分:0)
另一种方法如下:(我在新行中添加了血压的比例值)
hour_measure urinecolor blood pressure temp_column
0 1.00 red 40 40
1 1.15 red high 0
2 2.00 yellow low 0
3 3.00 yellow 20 20
df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)
这将使用血压列的数值创建一个新的temp_column。
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))
我已将MinMaxScaler应用于包含血压数字值的temp_column。我只是将换算后的数值放回血压列。
numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)
结果:
hour_measure urinecolor blood pressure
0 0.000 red 1
1 0.075 red high
2 0.500 yellow low
3 1.000 yellow 0.5