预处理缩微化和POS标记

时间:2018-09-12 20:27:18

标签: python-3.x keras nlp pos-tagger lemmatization

我是深度学习的新手,我发现去词性化和POS标记可能是改善我的Twitter情感分析器的好方法。 我正在使用以下Kaggle数据集:https://www.kaggle.com/kazanova/sentiment140

这是我的代码:

import pandas as pd
import numpy as np
import os
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

#Hyperparameters
max_sequence_length = 100

#Load data
Base_Dir = ''
Text_Data_Dir = os.path.join(Base_Dir, 'Sentiment140.csv')
df = pd.read_csv(Text_Data_Dir, encoding='latin-1', header=None)

#Organize columns
df.columns = ['sentiment', 'id', 'date', 'q', 'user', 'text']
df = df[df.sentiment != 2]
df['sentiment'] = df['sentiment'].map({0: 0, 4: 1})
df = df.drop(['id', 'date', 'q', 'user'], axis=1)
df = df[['text','sentiment']]

#Preprocessing
tokenizer = Tokenizer(num_words=max_sequence_length)
lemmatizer = WordNetLemmatizer()
tokenizer.fit_on_texts(df.text)
lexicon = [lemmatizer.lemmatize(i) for i in df.text]
pos_tag(lexicon)
sequences = tokenizer.texts_to_sequences(lexicon)
word_index = tokenizer.word_index

preprocessed_text = pad_sequences(sequences, maxlen=max_sequence_length)
labels = df.sentiment
labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', preprocessed_text.shape)
print('Shape of label tensor:', labels.shape)
print(labels)
print(preprocessed_text)

我得到的输出是否经过正确的去词素化并标记了POS,并准备好馈入LSTM神经网络,还是我错过了一些东西?

非常感谢

0 个答案:

没有答案