我正在尝试使用LabelEncoder编码一些文本值。为此我写信:
onehot = pd.DataFrame()
encoders = []
for column in df_resolved.loc[:, ((df_resolved.dtypes != np.int64)&(df_resolved.dtypes != np.int32))]:
enc = preprocessing.LabelEncoder()
encoders.append(enc)
onehot[column] = enc.fit_transform(df_resolved[column])
我需要使用新数据重现编码,我是否需要存储编码器,这就是我这样做的原因。但是,我收到一个错误:
TypeError:'>' 'str'和'int'实例之间不支持
我不明白为什么会这样。编码器应该能够根据文档编码字符串。我错过了什么?
完整堆栈跟踪:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-330-f9a564c7c9ab> in <module>()
8 enc = preprocessing.LabelEncoder()
9 encoders.append(enc)
---> 10 onehot[column] = enc.fit_transform(df_resolved[column])
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
129 y = column_or_1d(y, warn=True)
130 _check_numpy_unicode_bug(y)
--> 131 self.classes_, y = np.unique(y, return_inverse=True)
132 return y
133
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
209
210 if optional_indices:
--> 211 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
212 aux = ar[perm]
213 else:
TypeError: '>' not supported between instances of 'str' and 'int'
更新:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1706
Data columns (total 26 columns):
u_category 1436 non-null object
caller_id.country 1436 non-null object
number 1436 non-null object
priority 1436 non-null object
urgency 1436 non-null object
incident_state 1436 non-null object
u_subcategory 1436 non-null object
assigned_to 1436 non-null object
short_description 1436 non-null object
sys_created_on 1436 non-null datetime64[ns]
business_duration 1436 non-null int64
u_resolved_time 1436 non-null datetime64[ns]
u_reopen_count 1436 non-null int64
sys_created_by 1436 non-null int64
caller_id.u_display_name 1436 non-null object
u_on_behalf_of.u_display_name 1436 non-null object
u_on_behalf_of.email 1436 non-null object
u_actual_time_to_resolve 1436 non-null int64
comments 1436 non-null object
u_comments_and_work_notes 1436 non-null object
description 1436 non-null object
impact 1436 non-null object
u_problem_classification 1436 non-null object
resolution_time 1436 non-null float64
rawtext 1436 non-null object
cluster 1436 non-null int32
dtypes: datetime64[ns](2), float64(1), int32(1), int64(4), object(18)
memory usage: 337.3+ KB
这是df信息。我的SKLearn版本是0.18.1。