我在python中有一个像这样的数据框-
INSTRUMENT_TYPE_CD RISK_START_DT ... FIN_POS_IND PL_FINAL_IND
0 Physical Index 01-03-2017 00:00 ... 0 No
1 Fin Basis Swap 01-09-2018 00:00 ... 0 No
2 Physical Index 01-09-2017 00:00 ... 0 No
3 Physical Index 01-12-2016 00:00 ... 0 No
4 Fin Basis Swap 01-02-2018 00:00 ... 0 No
如您所见,列中元素的值是重复的,通常是字符串。我想将此数据帧转换为整数编码的数据帧,该数据帧将列中的每个唯一字符串映射到某个唯一的整数/数字。
到目前为止,我已经提出了这个(规范化方法),但是它不起作用。
normalise(dataframe)
def normalise(dataframe):
for column in dataframe:
dataframe[column] = dataframe.apply(unique_code_mapper(dataframe[column]))
return dataframe
def unique_code_mapper(column):
unique_array = []
for val in column:
if val in unique_array:
column.loc[val] = unique_array.index(val)
else:
unique_array.append(val)
column.loc[val] = unique_array.index(val)
return column
它返回以下错误:
TypeError: ("'Series' object is not callable", 'occurred at index INSTRUMENT_TYPE_CD')
答案 0 :(得分:1)
您可以使用factorize
:
print (df.dtypes)
INSTRUMENT_TYPE_CD object
RISK_START_DT datetime64[ns]
FIN_POS_IND int64
PL_FINAL_IND object
dtype: object
#select only object columns (obviously strings)
#cols = df.select_dtypes('object').columns
#select columns by names
cols = ['INSTRUMENT_TYPE_CD','PL_FINAL_IND']
for c in cols:
df[c] = pd.factorize(df[c])[0]
print (df)
INSTRUMENT_TYPE_CD RISK_START_DT FIN_POS_IND PL_FINAL_IND
0 0 01-03-2017 00:00 0 0
1 1 01-09-2018 00:00 0 0
2 0 01-09-2017 00:00 0 0
3 0 01-12-2016 00:00 0 0
4 1 01-02-2018 00:00 0 0