我想在循环网络上使用Universal Sentence Embedding。
因此,使用RNN嵌入的传统单词会将每个单词编码为一个向量,而RNN的time_step将是句子中单词的数量。
我想做的是使用句子嵌入将每个句子编码为512维矢量。 RNN的time_step将是文本中的句子数,对于我来说就是IMDB审阅。
我正在IMDB二进制分类上尝试此操作。问题是,无论我如何调整超参数,模型都不会学习。训练和测试的准确性保持在50%,这意味着该模型只能预测2个类别中的1个。
我将不胜感激!
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 128) 131584
_________________________________________________________________
dense_1 (Dense) (None, 2) 258
=================================================================
Total params: 131,842
Trainable params: 131,842
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From C:\Users\shaggyday\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0709 14:26:44.883890 9716 deprecation.py:323] From C:\Users\shaggyday\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
249/249 [==============================] - 55s 220ms/step - loss: 0.6937 - acc: 0.5004 - val_loss: 0.6931 - val_acc: 0.5061
Epoch 2/10
249/249 [==============================] - 68s 274ms/step - loss: 0.6970 - acc: 0.5002 - val_loss: 0.6942 - val_acc: 0.5009
Epoch 3/10
249/249 [==============================] - 71s 285ms/step - loss: 0.6947 - acc: 0.4961 - val_loss: 0.6980 - val_acc: 0.5009
Epoch 4/10
249/249 [==============================] - 70s 279ms/step - loss: 0.6938 - acc: 0.4998 - val_loss: 0.6956 - val_acc: 0.5033
Epoch 5/10
249/249 [==============================] - 66s 267ms/step - loss: 0.6936 - acc: 0.5018 - val_loss: 0.6939 - val_acc: 0.5046
Epoch 6/10
249/249 [==============================] - 63s 251ms/step - loss: 0.6931 - acc: 0.5003 - val_loss: 0.6933 - val_acc: 0.5058
用于预嵌入文本的代码是
file = 'train.csv'
df = pd.read_csv(file)
# df['sentiment'] = [1 if sentiment == 'positive' else 0 for sentiment in df['sentiment'].values]
x = df['review'].values
y = df['sentiment'].values
x_sent = []
for review in x:
x_sent.append(sent_tokenize( review ) )
num_sample = len(x)
val_split = int(num_sample*0.5)
x_train, y_train = x_sent, y
x_test, y_test = x_sent[val_split:], y[val_split:]
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/2"
out_dir = 'use(dan)'
embed = hub.Module(module_url)
num_files = 10
n_file = num_sample // num_files
for n in range( num_files ):
def batch_embed( batch, labels, lens, set_ ):
"""
batch: 1-D array of sentences
labels: labels for each reviews
lens: offsets for the reviews
set_: 'train' | 'test'
"""
with tf.Session( config=config ) as session:
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
print( 'Getting embeddings for the {} data'.format( set_ ) )
path = os.path.join( out_dir, 'embed_{}_{}.bin'.format( set_ , n ) )
if not os.path.exists( path ):
embeddings = session.run( embed( batch ) )
offset = 0
review_embeddings = []
for l in lens:
review_embeddings.append( embeddings[ offset : offset + l ] )
offset += l
with open( path, 'wb' ) as f:
pickle.dump( (review_embeddings, labels), f )
for i, re in enumerate(embeddings):
if re.shape[0]==0:
print( i, batch[i] )
train_batch = x_train[ n * n_file : min( len( x_train ), ( n + 1 ) * n_file ) ]
labels = y_train[ n * n_file : min( len( x_train ), ( n + 1 ) * n_file ) ]
lens = [ len( x ) for x in train_batch ]
sent_batch = [ sent for review in train_batch for sent in review ]
print( len( sent_batch ) )
batch_embed(sent_batch, labels, lens, 'train')
test_batch = x_test[ n * n_file : min( len( x_test ), ( n + 1 ) * n_file ) ]
labels = y_test[ n * n_file : min( len( x_test ), ( n + 1 ) * n_file ) ]
lens = [ len( x ) for x in test_batch ]
sent_batch = [ sent for review in test_batch for sent in review ]
print( len( sent_batch ) )
batch_embed(sent_batch, labels, lens, 'test')
该模型是一个非常简单的lstm,具有一层和256个神经元。因为每个IMDB评论的句子数都不相同,所以每一批都被填充