RNN类别变量的正确散列-二进制分类

时间:2019-04-11 20:04:55

标签: python tensorflow keras

我目前正在从事一个将所有分类数据作为输入和二进制输出(1 =是0 =否)的项目。

关于数据的一点点。 有3127854行。每个功能列本质上都是分类的。 以下是每列中有多少个唯一值。

attribute_1-87

attribute_2-2

attribute_3-202

attribute_4-3

attribute_5-3

attribute_6-367

我一直在遇到有关如何以允许我将其输入到RNN中的方式对列进行哈希/嵌入的问题。理想情况下,我想做的是对列进行哈希处理,嵌入,通过LSTM层,连接,展平,密集层,然后输出二进制预测。

如果代码有点草率/重复性很抱歉,因为我现在正在修改它们。

import numpy as np
import pandas as pd
import tensorflow as tf

data = pd.read_csv(".....")


#Creating hash buckets for categorical data

attribute_1_hashed = tf.feature_column.categorical_column_with_hash_bucket("attribute_1", len(auction_clean["attribute_1"].unique()))
attribute_2_hashed = tf.feature_column.categorical_column_with_hash_bucket("app_attribute_2", len(auction_clean["app_attribute_2"].unique()))
attribute_3_hashed = tf.feature_column.categorical_column_with_hash_bucket("attribute_3", len(auction_clean["attribute_3"].unique()))
attribute_4_hashed = tf.feature_column.categorical_column_with_hash_bucket("attribute_4",len(auction_clean["attribute_4"].unique()))
attribute_5_hashed = tf.feature_column.categorical_column_with_hash_bucket("attribute_5",len(auction_clean["attribute_5"].unique()))
attribute_6_hashed = tf.feature_column.categorical_column_with_hash_bucket("attribute_6", len(auction_clean["attribute_6"].unique()))


#Input layer
attribute_1_input = tf.keras.Input(shape=(1,), name='attribute_1')
attribute_2_input = tf.keras.Input(shape=(1,), name='attribute_2')
attribute_3_input = tf.keras.Input(shape=(1,), name='attribute_3')
attribute_4_input = tf.keras.Input(shape=(1,), name='attribute_4')
attribute_5_input = tf.keras.Input(shape=(1,), name='attribute_5')
attribute_6_input = tf.keras.Input(shape=(1,), name='attribute_6')

#Embedding Layer
longest_string = {}
for column in col_names:
    longest_string[column] = auction_clean[column].map(lambda x: len(x)).max()


embed_size = 10
attribute_1_embedded = tf.keras.layers.Embedding(len(data.attribute_1)+1, embed_size,
                                       input_length=1, name='attribute_1_embedding')(attribute_1_input)

attribute_2_embedded = tf.keras.layers.Embedding(len(data.app_attribute_2)+1, embed_size, 
                                       input_length=1, name='attribute_2_embedding')(attribute_2_input)

attribute_3_embedded = tf.keras.layers.Embedding(longest_string['attribute_3']+1, embed_size, 
                                       input_length=1, name='attribute_3_embedding')(attribute_3_input)

attribute_4_embedded = tf.keras.layers.Embedding(longest_string['attribute_4']+1, embed_size, 
                                       input_length=1, name='attribute_4_embedding')(attribute_4_input)

attribute_5_embedded = tf.keras.layers.Embedding(longest_string['attribute_5']+1, embed_size, 
                                       input_length=1, name='attribute_5_embedding')(attribute_5_input)  

attribute_6_embedded = tf.keras.layers.Embedding(data.attribute_6.max()+1, embed_size, 
                                       input_length=1, name='attribute_6_embedding')(attribute_6_input)


#LSTM Layer
num_units = 64
attribute_1_lstm = tf.keras.layers.LSTM(units=num_units)(attribute_1_embedded)
attribute_2_lstm = tf.keras.layers.LSTM(units=num_units)(attribute_2_embedded)
attribute_3_lstm = tf.keras.layers.LSTM(units=num_units)(attribute_3_embedded)
attribute_4_lstm = tf.keras.layers.LSTM(units=num_units)(attribute_4_embedded)
attribute_5_lstm = tf.keras.layers.LSTM(units=num_units)(attribute_5_embedded)
attribute_6_lstm = tf.keras.layers.LSTM(units=num_units)(attribute_6_embedded)


#Concatenate LSTM's output
concatenated = tf.keras.layers.Concatenate()([attribute_1_lstm, 
                                           attribute_2_lstm,
                                           attribute_3_lstm,
                                           attribute_4_lstm,
                                           attribute_5_lstm,
                                           attribute_6_lstm])

flatten = tf.keras.layers.Flatten()(concatenated)

#Dense layer
dense = tf.keras.layers.Dense(num_units, activation='relu')(concatenated)

#Output Layer
out = tf.keras.layers.Dense(1, activation="sigmoid", name="main_output")(dense)

model = tf.keras.Model(
    inputs = [attribute_1_input,attribute_2_input,attribute_3_input,attribute_4_input,attribute_5_input,attribute_6_input],
    outputs = out,
)

model.compile(
    tf.train.AdamOptimizer(0.1),
    loss='categorical_crossentropy',
    metrics=['accuracy'],
)


history = model.fit(
    [attribute_1_hashed, attribute_2_hashed,attribute_3_hashed,attribute_4_hashed,attribute_5_hashed,attribute_6_hashed],
    data.y,
    batch_size=10,
    epochs=1,
    steps_per_epoch = 1,
    verbose=0
)

现在最终发生的是model.fit调用最终无法正常工作

ValueError:输入数组应具有与目标数组相同数量的样本。找到3个输入样本和3127854个目标样本。

任何输入或指导都会有所帮助,一如既往,请让我知道是否需要澄清。

谢谢!

0 个答案:

没有答案