我是一个热门编码变量,它有超过3k类并且遇到MemoryError。我有其他变量,我也是一个热门编码,但它们的类别较少。我可以成功进行单热编码的变量的最大类别是935.
我使用以下代码:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
def onehot(featurename):
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(df[featurename].reshape(-1, 1))
trn_onehot_encoded = onehot_encoded[msk]
val_onehot_encoded = onehot_encoded[~msk]
return trn_onehot_encoded, val_onehot_encoded
trn_onehot_encoded_mt, val_onehot_encoded_mt = onehot('modality_type')
trn_onehot_encoded_mr, val_onehot_encoded_mr = onehot('roleid')
trn_onehot_encoded_sub, val_onehot_encoded_sub = onehot('subject')
trn_onehot_encoded_quartile, val_onehot_encoded_quartile = onehot('quartile')
trn_onehot_encoded_country, val_onehot_encoded_country = onehot('country_short')
trn_onehot_encoded_region, val_onehot_encoded_region = onehot('region')
trn_onehot_encoded_groupmemberornot, val_onehot_encoded_groupmemberornot = onehot('groupmemberornot')
trn_onehot_encoded_highlight, val_onehot_encoded_highlight = onehot('highlight_bin_new')
trn_onehot_encoded_note, val_onehot_encoded_note = onehot('note_bin_new')
trn_onehot_encoded_eid, val_onehot_encoded_eid = onehot('new_eid')
我编码变量new_eid
的最后一行代码是我得到MemoryError
或死内核的地方。
为了尝试解决此错误,我在函数sparse
的{{1}}中将字段true
设置为OneHotEncoder
。
适合onehot()
的代码如下:
Sparse=True
但是当我尝试适应模型时,我收到以下错误:
<All the code above with Sparse=True>
mt = Input(shape=(trn_onehot_encoded_mt.shape[1],))
mr = Input(shape=(trn_onehot_encoded_mr.shape[1],))
sub = Input(shape=(trn_onehot_encoded_sub.shape[1],))
gmon = Input(shape=(trn_onehot_encoded_groupmemberornot.shape[1],))
region = Input(shape=(trn_onehot_encoded_region.shape[1],))
country = Input(shape=(trn_onehot_encoded_country.shape[1],))
highlight = Input(shape=(trn_onehot_encoded_highlight.shape[1],))
note = Input(shape=(trn_onehot_encoded_note.shape[1],))
#Model definition
x = merge([u, a], mode='concat')
x = Flatten()(x)
x = merge([x, mt], mode='concat')
x = merge([x, mr], mode='concat')
x = merge([x, sub], mode='concat')
x = merge([x, gmon], mode='concat')
x = merge([x, region], mode='concat')
x = merge([x, country], mode='concat')
x = merge([x, highlight], mode='concat')
x = merge([x, note], mode='concat')
x = Dense(1000, activation='relu')(x)
BatchNormalization()
Dropout(0.5)
x = Dense(200, activation='relu')(x)
BatchNormalization()
Dropout(0.5)
x = Dense(50, activation='relu')(x)
BatchNormalization()
x = Dense(2, activation='softmax')(x)
nn = Model([user_in, artifact_in, mt, mr, sub, gmon, region, country, highlight, note], x)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
def fit_nn(lr, bs):
nn.optimizer.lr = lr
nn.fit([trn.member_id,
trn.artifact_id,
trn_onehot_encoded_mt,
trn_onehot_encoded_mr,
trn_onehot_encoded_sub,
trn_onehot_encoded_groupmemberornot,
trn_onehot_encoded_region,
trn_onehot_encoded_country,
trn_onehot_encoded_highlight,
trn_onehot_encoded_note], trn_onehot_encoded_quartile,
batch_size=bs,
epochs=1,
validation_data=([val.member_id,
val.artifact_id,
val_onehot_encoded_mt,
val_onehot_encoded_mr,
val_onehot_encoded_sub,
val_onehot_encoded_groupmemberornot,
val_onehot_encoded_region,
val_onehot_encoded_country,
val_onehot_encoded_highlight,
val_onehot_encoded_note], val_onehot_encoded_quartile)
)
bs = 10000
fit_nn(0.001, bs)
我无法使用稀疏或非稀疏数组拟合模型。非稀疏数组给我一个MemoryErrror,稀疏数组给我上面提到的错误。
如何解决此错误?必须有一种方法可以编码具有大量类别的变量。