仅缩放包含字符串的数据框中的数值

时间:2019-12-07 08:51:29

标签: python python-3.x pandas scikit-learn

我在python中,我尝试缩放到数据框

subject_id hour_measure         urinecolor   blood pressure                  
3          1.00                 red          40
           1.15                 red          high
4          2.00              yellow          low

因为它包含数字和文本列 以下代码给我错误

 #MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)

由于数据框包含字符串,这给了我错误,如何告诉python仅缩放包含数字的列,还缩放字符串列中的数值。

2 个答案:

答案 0 :(得分:1)

将非数字值转换为缺失值,然后使用alternative solution进行缩放,最后将缺失值替换回原始值:

print (df)
   subject_id  hour_measure urinecolor blood pressure
0           3          1.00        red             40
1           3          1.15        red           high
2           4          2.00     yellow            low
3           5          5.00     yellow            100

df = df.set_index('subject_id')

df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())

df = df2.combine_first(df)
print (df)
            hour_measure urinecolor blood pressure
subject_id                                        
3                 0.0000        red              0
3                 0.0375        red           high
4                 0.2500     yellow            low
5                 1.0000     yellow              1

第一个解决方案

我建议用字典将文本列替换为数字,例如:

dbp = {'high': 150, 'low': 60}

df['blood pressure'] = df['blood pressure'].replace(dbp)

一起:

#if subject_id are numeric convert them to index
df = df.set_index('subject_id')

dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)

print (df)
            hour_measure urinecolor  blood pressure
subject_id                                         
3                   1.00        red              40
3                   1.15        red             150
4                   2.00     yellow              60

print (df.dtypes)
hour_measure      float64
urinecolor         object
blood pressure      int32
dtype: object

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0.         0.        ]
 [0.15       1.        ]
 [1.         0.18181818]]

详细信息

print (df.select_dtypes(np.number))
            hour_measure  blood pressure
subject_id                              
3                   1.00              40
3                   1.15             150
4                   2.00              60

答案 1 :(得分:0)

另一种方法如下:(我在新行中添加了血压的比例值)

       hour_measure urinecolor blood pressure  temp_column
0          1.00        red             40           40
1          1.15        red           high            0
2          2.00     yellow            low            0
3          3.00     yellow             20           20

df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)

这将使用血压列的数值创建一个新的temp_column。

scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))

我已将MinMaxScaler应用于包含血压数字值的temp_column。我只是将换算后的数值放回血压列。

numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
    df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)

结果:

   hour_measure urinecolor blood pressure
0         0.000        red              1
1         0.075        red           high
2         0.500     yellow            low
3         1.000     yellow            0.5