下面是imdb数据集的示例代码。我是初学者并且在教程之后,我正在尝试在keras中加载我自己的数据集。如何修改代码。我将非常感激
import keras
#Using keras to load the dataset with the top_words
max_features = 10000 #max number of words to include, words are ranked by how often they occur (in training set)
max_review_length = 1600
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print 'loaded dataset...'
#Pad the sequence to the same length
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
index_dict = keras.datasets.imdb.get_word_index()
答案 0 :(得分:0)
这是Pandas和CountVectorizer的简单解决方案。然后,您需要填充数据并按上述方式分成测试和训练。
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data = {
'label': [0, 1, 0, 1],
'text': ['first bit of text', 'second bit of text', 'third text', 'text number four']
}
data = pd.DataFrame.from_dict(data)
# Form vocab dictionary
vectorizer = CountVectorizer()
vectorizer.fit_transform(data['text'].tolist())
vocab_text = vectorizer.vocabulary_
# Convert text
def convert_text(text):
text_list = text.split(' ')
return [vocab_text[t]+1 for t in text_list]
data['text'] = data['text'].apply(convert_text)
# Get X and y matrices
y = np.array(data['label'])
X = np.array(data['text'])