我的文本分类器模型无法在多个类中得到改善

时间:2019-11-18 04:39:27

标签: python pandas tensorflow machine-learning text-classification

我正在尝试为文本分类训练一个模型,并且该模型采用从文章中嵌入的最多300个整数的列表。该模型可以毫无问题地进行训练,但精度几乎不会提高。

目标包含41个类别,将其编码为从0到41的int,然后进行归一化。

表格看起来像这样

Table1

此外,我不知道我的模型应该是什么样子,因为我参考了以下两个不同的示例

  • 具有一个输入列和一个输出列Example 1的二进制分类器
  • 具有多列作为输入Example 2的多个类分类器

我尝试基于两种模型修改模型,但是模型的准确性不会改变,甚至每个时期都会降低

我应该在模型中添加更多层还是做一些我没有意识到的愚蠢行为?

注意:如果“ df.pickle”下载链接断开,请使用this link

from sklearn.model_selection import train_test_split
from urllib.request import urlopen
from os.path import exists
from os import mkdir
import tensorflow as tf
import pandas as pd
import pickle

# Define dataframe path
df_path = 'df.pickle'

# Check if local dataframe exists
if not exists(df_path):
  # Download binary from dropbox
  content = urlopen('https://ucd92a22d5e0d4d29b8edb608305.dl.dropboxusercontent.com/cd/0/get/Askx_25n3JI-jmnZsWXmMmRgd4O2EH1w9l0U6zCMq7xdSXs_IN_i2zuUviseqa9N7-WrReFbGhQi8CeseV5cNsFTO8dzRmSdxjr-MWEDQNpPaZ8Ik29E_58YAjY57qTc4CA/file#').read()

  # Write to file
  with open(df_path, 'wb') as file: file.write(content)

  # Load the dataframe from bytes
  df = pickle.loads(content)
# If the file exists (aka. downloaded)
else:
  # Load the dataframe from file
  df = pickle.load(open(df_path, 'rb'))

# Normalize the category
df['Category_Code'] = df['Category_Code'].apply(lambda x: x / 41)

train_df, test_df = [pd.DataFrame() for _ in range(2)]

x_train, x_test, y_train, y_test = train_test_split(df['Content_Parsed'], df['Category_Code'], test_size=0.15, random_state=8)
train_df['Content_Parsed'], train_df['Category_Code'] = x_train, y_train
test_df['Content_Parsed'], test_df['Category_Code'] = x_test, y_test

# Variable containing the number of words we want to keep in our vocabulary
NUM_WORDS = 10000
# Input/Token length
SEQ_LEN = 300

# Create tokenizer for our data
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=NUM_WORDS, oov_token='<UNK>')
tokenizer.fit_on_texts(train_df['Content_Parsed'])

# Convert text data to numerical indexes
train_seqs=tokenizer.texts_to_sequences(train_df['Content_Parsed'])
test_seqs=tokenizer.texts_to_sequences(test_df['Content_Parsed'])

# Pad data up to SEQ_LEN (note that we truncate if there are more than SEQ_LEN tokens)
train_seqs=tf.keras.preprocessing.sequence.pad_sequences(train_seqs, maxlen=SEQ_LEN, padding="post")
test_seqs=tf.keras.preprocessing.sequence.pad_sequences(test_seqs, maxlen=SEQ_LEN, padding="post")

# Create Models folder if not exists
if not exists('Models'): mkdir('Models')

# Define local model path
model_path = 'Models/model.pickle'

# Check if model exists/pre-trained
if not exists(model_path):
  # Define word embedding size
  EMBEDDING_SIZE = 16

  # Create new model
  '''
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
    # tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  '''
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_SIZE),
      # tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
      tf.keras.layers.GlobalAveragePooling1D(),
      tf.keras.layers.Dense(EMBEDDING_SIZE, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  # Compile the model
  model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
  )

  # Stop training when a monitored quantity has stopped improving.
  es = tf.keras.callbacks.EarlyStopping(monitor='val_acc', mode='max', patience=1)

  # Define batch size (Can be tuned to improve model accuracy)
  BATCH_SIZE = 16
  # Define number or cycle to train
  EPOCHS = 20

  # Using GPU (If error means you don't have GPU. Use CPU instead)
  with tf.device('/GPU:0'):
    # Train/Fit the model
    history = model.fit(
      train_seqs, 
      train_df['Category_Code'].values, 
      batch_size=BATCH_SIZE, 
      epochs=EPOCHS, 
      validation_split=0.2,
      validation_steps=30,
      callbacks=[es]
    )

  # Evaluate the model
  model.evaluate(test_seqs, test_df['Category_Code'].values)

  # Save the model into a file
  with open(model_path, 'wb') as file: file.write(pickle.dumps(model))
else:
  # Load the model
  model = pickle.load(open(model_path, 'rb'))

# Check the model
model.summary()

1 个答案:

答案 0 :(得分:0)

经过2天的调整和了解了更多示例之后,我发现this网站对多类别分类的解释很好。

我所做的更改的详细信息包括:

  1. 由于我要为多个类建立模型,因此在模型编译期间,模型应使用categorical_crossentropy,因为它是损失函数而不是binary_crossentropy

  2. 该模型应生成与您的总类长度相似的输出数量,在我的情况下,您将要对其进行分类 41

  3. 此外,您只需要在模型的最后一层使用一个激活函数,损失函数应为"softmax"

  4. 您将需要根据要分类的类数来相应地调整图层。有关如何改善模型的信息,请参见here

最终代码如下:

from sklearn.model_selection import train_test_split
from urllib.request import urlopen
from functools import reduce
from os.path import exists
from os import listdir
from sys import exit
import tensorflow as tf
import pandas as pd
import pickle
import re

# Specify dataframe path
df_path = 'df.pickle'
# Check if the file exists
if not exists(df_path):
  # Specify url of the dataframe binary
  url = 'https://www.dropbox.com/s/76hibe24hmpz3bk/df.pickle?dl=1'
  # Read the byte content from url
  content = urlopen(url).read()
  # Write to a file to save up time
  with open(df_path, 'wb') as file: file.write(pickle.dumps(content))
  # Unpickle the dataframe
  df = pickle.loads(content)
else:
  # Load the pickle dataframe
  df = pickle.load(open(df_path, 'rb'))

# Useful variables
MAX_NUM_WORDS = 50000                        # Vocabulary size for our tokenizer
MAX_SEQ_LENGTH = 600                         # Maximum length of tokens (for padding later)
EMBEDDING_SIZE = 256                         # Embedding size (Tweak to improve accuracy)
OUTPUT_LENGTH = len(df['Category'].unique()) # Number of class to be classified

# Create our tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_WORDS, lower=True)
# Fit our tokenizer with words/tokens
tokenizer.fit_on_texts(df['Content_Parsed'].values)
# Get our token vocabulary
word_index = tokenizer.word_index
print('Found {} unique tokens'.format(len(word_index)))

# Parse our text into sequence of numbers using our tokenizer
X = tokenizer.texts_to_sequences(df['Content_Parsed'].values)
# Pad the sequence up to the MAX_SEQ_LENGTH
X = tf.keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQ_LENGTH)
print('Shape of feature tensor: {}'.format(X.shape))

# Convert our labels into dummy variable (More info on the link provided above)
Y = pd.get_dummies(df['Category']).values
print('Shape of label tensor: {}'.format(Y.shape))

# Split our features and labels into test and train dataset
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Creating our model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(MAX_NUM_WORDS, EMBEDDING_SIZE, input_length=MAX_SEQ_LENGTH))
model.add(tf.keras.layers.SpatialDropout1D(0.2))
# The number 64 could be changed based on your model performance
model.add(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2))
# Our output layer with length similar to the OUTPUT_LENGTH
model.add(tf.keras.layers.Dense(OUTPUT_LENGTH, activation='softmax'))
# Compile our model with "categorical_crossentropy" loss function
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model variables
EPOCHS = 100                          # Number of cycle to run (The early stopping may stop the training process accordingly)
BATCH_SIZE = 64                       # Batch size (Tweaking this may improve model performance a bit)
checkpoint_path = 'model_checkpoints' # Checkpoint path of our model

# Use GPU if available
with tf.device('/GPU:0'):
  # Fit/Train our model
  history = model.fit(
    x_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.1,
    callbacks=[
      tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.0001),
      tf.keras.callbacks.ModelCheckpoint(
        checkpoint_path, 
        monitor='val_acc', 
        save_best_only=True, 
        save_weights_only=False
      )
    ],
    verbose=1
  )

现在,我的模型精度很好,并且每个时期都提高了,但是由于验证精度(val_acc为76%到77%)表现不佳,我可能需要对模型进行一些调整。

下面提供了输出快照

Output snapshot.png