Question

我一直在尝试使用keras令牌生成器将nlp模型的令牌限制为vocab_size = 2000；但是，我不明白为什么当我尝试计算独立令牌的类型时会收到总共690960个令牌。我的代码是使用Kaggle中的Sentiment140数据库：https://www.kaggle.com/kazanova/sentiment140

我的代码是：

# coding=latin-1
import pandas as pd
import numpy as np
import os
from timeit import default_timer as timer
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#Hyperparameters
max_sequence_length = 200
vocab_size = 2000

#Load data
Base_Dir = ''
Text_Data_Dir = os.path.join(Base_Dir, 'Sentiment140.csv')
df = pd.read_csv(Text_Data_Dir, encoding='latin-1', header=None)

#Organize columns
df.columns = ['sentiment', 'id', 'date', 'q', 'user', 'text']
df = df[df.sentiment != 2]
df['sentiment'] = df['sentiment'].map({0: 0, 4: 1})
df = df.drop(['id', 'date', 'q', 'user'], axis=1)
df = df[['text','sentiment']]

#Preprocessing
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df.text)
sequences = tokenizer.texts_to_sequences(df.text)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

preprocessed_text = pad_sequences(sequences, maxlen=max_sequence_length)
labels = df.sentiment
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', preprocessed_text.shape)
print('Shape of label tensor:', labels.shape)

此刻，我真的被各种不同的标记卡住了，这使我的模型过度拟合的方式太多了。谁能给我有关我误会的任何提示？

非常感谢！

限制令牌数量问题keras

0 个答案: