我想在熊猫中分解一列数据框并将其添加为新列。列的值是一个字符串。
例如
COL_1
'TRY A TEST'
'TRY A TEST'
'PLAY Q'
'PLAY Q'
我希望将其转换为数字,例如:
COL_1 NEW_COL
'TRY A TEST' 0
'TRY A TEST' 0
'PLAY Q' 1
'PLAY Q' 1
但是,我得到了:
x = 'TRY A TEST'
my_df['NEW_COL'] = my_df['COL_1'].apply(lambda x: pd.factorize(x)[0])
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64), array(['TRY A TEST'], dtype=object))
似乎每个字符都转换为数字。
我也遇到错误:
TypeError: 'float' object is not iterable
“ COL_1”中没有浮点数,它是字符串。
有什么建议吗?
答案 0 :(得分:1)
替代方法,使用Categorical
dtype:
my_df['NEW_COL'] = my_df['COL_1'].astype('category').cat.codes
答案 1 :(得分:1)
简单的解决方案:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
my_df['NEW_COL'] = le.fit_transform(my_df['COL_1'].astype(str))
my_df
COL_1 NEW_COL
0 TRY A TEST 1
1 TRY A TEST 1
2 PLAY Q 0
3 PLAY Q 0
对于大型数据框/多列,您可以简单地用于循环
例如。
my_df
pets owner location
0 cat Champ San_Diego
1 dog Ron New_York
2 cat Brick New_York
3 monkey Champ San_Diego
4 dog Veronica San_Diego
5 dog Ron New_York
############
for column in ['pets','owner','location']:
le = preprocessing.LabelEncoder()
my_df[str(column+'_num')] = le.fit_transform(my_df[column].astype(str))
############
my_df
pets owner location pets_num owner_num location_num
0 cat Champ San_Diego 0 1 1
1 dog Ron New_York 1 2 0
2 cat Brick New_York 0 0 0
3 monkey Champ San_Diego 2 1 1
4 dog Veronica San_Diego 1 3 1
5 dog Ron New_York 1 2 0