我想将keras用于作者身份归属。我有一个(文本,标签)列表。我正在尝试使用keras内置矢量化器但是我收到以下错误:
矢量化序列数据... Traceback(最近一次调用最后一次):File中的文件“”,第1行 “/home/angelo/org/courses/corpusling/finalproject/src/neuralnet.py” 第46行,在 X_train = tokenizer.texts_to_matrix(X_train,mode ='binary')文件“/home/angelo/org/courses/corpusling/finalproject/venv0/lib/python3.5/site-packages/keras/preprocessing/text.py”, 第166行,在texts_to_matrix中 sequences = self.texts_to_sequences(texts)文件“/home/angelo/org/courses/corpusling/finalproject/venv0/lib/python3.5/site-packages/keras/preprocessing/text.py”, 第131行,在texts_to_sequences中 对于self.texts_to_sequences_generator(文本)中的vect:文件“/home/angelo/org/courses/corpusling/finalproject/venv0/lib/python3.5/site-packages/keras/preprocessing/text.py”, 第150行,在texts_to_sequences_generator中 i = self.word_index.get(w)AttributeError:'Tokenizer'对象没有属性'word_index'
以下是我目前的代码:
import glob
import os
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.preprocessing.text import Tokenizer
from keras.utils import np_utils
def get_label(filename):
tmp = os.path.split(filename)[0]
label = os.path.basename(tmp)
return label
def read_file(filename):
with open(filename) as f:
text = f.read()
return text
traindocs = "../data/C50/C50train/*/*.txt"
testdocs = "../data/C50/C50test/*/*.txt"
documents_train = (read_file(f) for f in glob.iglob(traindocs))
labels_train = (get_label(f) for f in glob.iglob(traindocs))
documents_test = (read_file(f) for f in glob.iglob(testdocs))
labels_test = (get_label(f) for f in glob.iglob(testdocs))
df_train = pd.DataFrame([documents_train, labels_train])
df_train = df_train.transpose()
df_train.rename(columns={0: 'text', 1: 'author'}, inplace=True)
df_test = pd.DataFrame([documents_test, labels_test])
df_test = df_test.transpose()
df_test.rename(columns={0: 'text', 1: 'author'}, inplace=True)
max_words = 1000
print('Vectorizing sequence data...')
tokenizer = Tokenizer(nb_words=max_words)
X_train, Y_train = df_train.text, df_train.author
X_test, Y_test = df_test.text, df_test.author
X_train = tokenizer.texts_to_matrix(X_train, mode='binary')
X_test = tokenizer.texts_to_matrix(X_test, mode='binary')
nb_classes = np.max(Y_train) + 1
print('Convert class vector to binary class matrix (for use with categorical_crossentropy)')
Y_train = np_utils.to_categorical(Y_train, nb_classes)
Y_test = np_utils.to_categorical(Y_test, nb_classes)
model = Sequential()
model.add(Dense(output_dim=512, input_dim=(max_words,)))
model.add(Activation("relu"))
model.add(Dense(output_dim=(np.max(Y_train)+1)))
model.add(Activation("softmax"))
model.compile(loss='categorical_crossentropy',
optimizer='sgd', metrics=['accuracy'])
model.fit(X_train, Y_train, nb_epoch=5, batch_size=32)
loss_and_metrics = model.evaluate(X_test, Y_test, batch_size=32)
答案 0 :(得分:12)
在使用tokenizer.fit_on_texts(texts)
tokenizer.texts_to_matrix()
此处texts
是文本数据列表(列车和测试)。
fit_on_texts()
使用它来构建word_index
。它只是数字映射的唯一字。此映射稍后用于生成矩阵。