Keras:推文分类

时间:2018-11-14 18:29:09

标签: python machine-learning keras text-classification tweets

您好,亲爱的论坛成员,

我有一个2000万个随机收集的个人推文的数据集(没有两个推文来自同一帐户)。让我将此数据集称为“常规”数据集。另外,我还有另一个“特定”数据集,其中包括从吸毒者(阿片类药物)滥用者那里收集的100,000条推文。每条推文都有至少一个与之相关的标签,例如阿片类药物,成瘾,药物过量,氢可酮等(最多25个标签)。

我的目标是使用“特定”数据集来使用Keras训练模型,然后使用它在“通用”数据集中标记推文,以识别可能由吸毒者撰写的推文。

按照source1source2中的示例,我设法建立了这种模型的简单工作版本:

from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils

# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]

# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)

# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)

# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)

# test prediction accuracy
score = model.evaluate(x_test, y_test, 
                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

# make predictions using a test set
for i in range(1000):    
    prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_ 
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)

为了前进,我想澄清一些事情:

  1. 比方说,我所有的培训推文都有一个标签-阿片类药物。然后,如果我通过它传递未标记的推文,那么该模型是否很可能只是将它们全部标记为阿片类药物,因为它什么都不知道?我是否应该出于学习目的使用各种不同的推文/标签?也许,出于培训目的,有任何关于选择推文/标签的通用指南吗?
  2. 如何添加更多带有标签的列进行训练(代码中没有使用类似的列)?
  3. 一旦我训练了模型并达到了适当的准确性,我如何通过它传递未标记的推文进行预测?
  4. 如何添加混淆矩阵?

也非常感谢其他相关反馈。

谢谢!

“一般”推文示例:

everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.

“特定”推文的示例:

$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids

1 个答案:

答案 0 :(得分:1)

我对此的看法是:

  1. 使用来自常规+特定数据的推文创建一个新的数据集。假设200k-250K,其中100K是您的特定数据集,其余的是常规数据

  2. 使用25个关键字/标签并编写一条规则,如果一条推文中有一个或多个是DA(药物滥用者)或NDA(非药物滥用者)。这将是您的因变量。

  3. 您的新数据集将是一列包含所有tweet的列,而另一列将包含因变量,告诉它是DA还是NDA

  4. 现在分为训练/测试并使用keras或任何其他算法。这样它才能学习。

  5. 然后通过绘制混淆矩阵来测试模型

  6. 将“常规”的其他剩余数据集传递给该模型并进行检查,

如果它们不是特定数据集中的25个以外的新词,则从您建立的模型中,它仍然会尝试根据组合在一起的词组,语气等来智能地猜测正确的类别。