我尝试过将数据拆分为令牌。所有数据均小写。 我想建立曼哈顿LSTM模型。
我尝试在Tokenizer()中添加一些参数。 例如:
num_words = max_nb_words
filters ='!“#$%&()* +,-。/ :; <=>?@ [] ^ _`{|}〜'
lower = True
max_nb_words = 50000
max_seq_length = max(max([len(s) for s in x_left]),max([len(s) for s in x_right]))
tockenizer_left = Tokenizer(num_words=max_nb_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tockenizer_left.fit_on_texts(data_train['Data_Name_left'].values)
x_left_tokens = tockenizer_left.texts_to_sequences(x_left)
x_left_pad = pad_sequences(tockenizer_left, maxlen=max_seq_length)
tockenizer_right = Tokenizer()
tockenizer_right.fit_on_texts(data_train['Data_Name_right'].values)
x_right_tokens = tockenizer_right.texts_to_sequences(x_right)
x_right_pad = pad_sequences(x_right_tokens,xlen=max_seq_length)
vocab_size = max(len(tockenizer_left.word_index) +1, len(tockenizer_right.word_index) +1)
我期望文本序列。
答案 0 :(得分:0)
答案是-Tockenizer(lower = False)