如何对没有MemoryError的3k类别的变量进行热编码

时间:2018-01-05 18:21:40

标签: python pandas deep-learning one-hot-encoding

我是一个热门编码变量,它有超过3k类并且遇到MemoryError。我有其他变量,我也是一个热门编码,但它们的类别较少。我可以成功进行单热编码的变量的最大类别是935.

我使用以下代码:

from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

def onehot(featurename):
    onehot_encoder = OneHotEncoder(sparse=False)
    onehot_encoded = onehot_encoder.fit_transform(df[featurename].reshape(-1, 1))
    trn_onehot_encoded = onehot_encoded[msk]
    val_onehot_encoded = onehot_encoded[~msk]
    return trn_onehot_encoded, val_onehot_encoded

trn_onehot_encoded_mt, val_onehot_encoded_mt = onehot('modality_type')
trn_onehot_encoded_mr, val_onehot_encoded_mr = onehot('roleid')
trn_onehot_encoded_sub, val_onehot_encoded_sub = onehot('subject')
trn_onehot_encoded_quartile, val_onehot_encoded_quartile = onehot('quartile')
trn_onehot_encoded_country, val_onehot_encoded_country = onehot('country_short')
trn_onehot_encoded_region, val_onehot_encoded_region = onehot('region')
trn_onehot_encoded_groupmemberornot, val_onehot_encoded_groupmemberornot = onehot('groupmemberornot')
trn_onehot_encoded_highlight, val_onehot_encoded_highlight = onehot('highlight_bin_new')
trn_onehot_encoded_note, val_onehot_encoded_note = onehot('note_bin_new')
trn_onehot_encoded_eid, val_onehot_encoded_eid = onehot('new_eid')

我编码变量new_eid的最后一行代码是我得到MemoryError或死内核的地方。

为了尝试解决此错误,我在函数sparse的{​​{1}}中将字段true设置为OneHotEncoder

适合onehot()的代码如下:

Sparse=True

但是当我尝试适应模型时,我收到以下错误:

<All the code above with Sparse=True>
mt = Input(shape=(trn_onehot_encoded_mt.shape[1],))
mr = Input(shape=(trn_onehot_encoded_mr.shape[1],))
sub = Input(shape=(trn_onehot_encoded_sub.shape[1],))
gmon = Input(shape=(trn_onehot_encoded_groupmemberornot.shape[1],))
region = Input(shape=(trn_onehot_encoded_region.shape[1],))
country = Input(shape=(trn_onehot_encoded_country.shape[1],))
highlight = Input(shape=(trn_onehot_encoded_highlight.shape[1],))
note = Input(shape=(trn_onehot_encoded_note.shape[1],))

#Model definition
x = merge([u, a], mode='concat')
x = Flatten()(x)
x = merge([x, mt], mode='concat')
x = merge([x, mr], mode='concat')
x = merge([x, sub], mode='concat')
x = merge([x, gmon], mode='concat')
x = merge([x, region], mode='concat')
x = merge([x, country], mode='concat')
x = merge([x, highlight], mode='concat')
x = merge([x, note], mode='concat')
x = Dense(1000, activation='relu')(x)
BatchNormalization()
Dropout(0.5)
x = Dense(200, activation='relu')(x)
BatchNormalization()
Dropout(0.5)
x = Dense(50, activation='relu')(x)
BatchNormalization()
x = Dense(2, activation='softmax')(x)
nn = Model([user_in, artifact_in, mt, mr, sub, gmon, region, country, highlight, note], x)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

def fit_nn(lr, bs):
    nn.optimizer.lr = lr        
    nn.fit([trn.member_id, 
        trn.artifact_id, 
        trn_onehot_encoded_mt, 
        trn_onehot_encoded_mr, 
        trn_onehot_encoded_sub, 
        trn_onehot_encoded_groupmemberornot, 
        trn_onehot_encoded_region, 
        trn_onehot_encoded_country,
        trn_onehot_encoded_highlight,
        trn_onehot_encoded_note], trn_onehot_encoded_quartile, 
       batch_size=bs, 
       epochs=1, 
       validation_data=([val.member_id, 
                         val.artifact_id, 
                         val_onehot_encoded_mt, 
                         val_onehot_encoded_mr, 
                         val_onehot_encoded_sub, 
                         val_onehot_encoded_groupmemberornot, 
                         val_onehot_encoded_region, 
                         val_onehot_encoded_country,
                         val_onehot_encoded_highlight,
                         val_onehot_encoded_note], val_onehot_encoded_quartile)
           )


bs = 10000
fit_nn(0.001, bs)

我无法使用稀疏或非稀疏数组拟合模型。非稀疏数组给我一个MemoryErrror,稀疏数组给我上面提到的错误。

如何解决此错误?必须有一种方法可以编码具有大量类别的变量。

0 个答案:

没有答案