我目前正致力于利用LSTM神经网络构建视频游戏推荐系统。
我是深度学习和自然语言处理的新手。到目前为止,我已经完成了以下工作:
收集数据 - 我正在使用amazons video-Games.json
数据集。此外,由于下载的数据集未在测试/训练集中划分,因此我必须创建一些自定义函数,这些函数正常工作。
预处理数据 - 在Keras的帮助下,我成功地用几行代码成功地标记了我的数据集。
以下是我的DataLoader类的代码段:
class DataLoader:
def __init__(self, path):
self.path = path
def parse(self):
g = gzip.open(self.path, 'rb')
for l in g:
yield eval(l)
def convert_to_dataframe(self):
i = 0
df = {}
for d in self.parse():
df[i] = d
i += 1
return pd.DataFrame.from_dict(df,orient='index')
def create_train_test_set(self, data, labels, ratio = 75):
train_set_x = []
test_set_x = []
train_set_y = []
test_set_y = []
number_of_training_examples = round(len(data)*(ratio/100))
number_of_test_examples = round(len(data)*((100-ratio)/100))
counter = 0
for data_point, data_label in zip(data, labels):
if counter < number_of_training_examples:
train_set_x.append(data_point)
train_set_y.append(data_label)
counter = counter + 1
else:
test_set_x.append(data_point)
test_set_y.append(data_label)
counter = counter + 1
return train_set_x,train_set_y,test_set_x,test_set_y
这是我的DataPreprocessor类的片段:
class DataPreprocessing:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def create_train_test_token_set(self, train_set_x, test_set_x):
train_token_set_x = self.tokenizer.texts_to_sequences(train_set_x)
test_token_set_x = self.tokenizer.texts_to_sequences(test_set_x)
return train_token_set_x, test_token_set_x
def pad(self, train_token_set_x, maxlen , padding , truncating):
train_pad_x = pad_sequence(train_token_set_x, maxlen, padding, truncating)
test_pad_x = pad_sequences(test_token_set_x, maxlen, padding, truncating)
return train_pad_x, test_pad_x
tokenizer = Tokenizer(num_words = 10000)
tokenizer.fit_on_texts(reviews)
data_preprocessor = DataPreprocessing(tokenizer)
train_token_set_x, test_token_set_x = data_preprocessor.create_train_test_token_set(train_set_x, test_set_x)
print('Training example : ', train_set_x[1])
print('Tokenized training example : ', train_token_set_x[1])
打印后,我得到以下输出,这是正确的,因为我已经检查过:
Training example : I want to start off by saying I have never played the Call of Duty games. This is only the second first person shooter game that I have own. I think it is a lot of fun. Has good graphics and nice story line. It does take some skill to get through the levels. I think all players can enjoy this game. There are three levels to choose from based on your skill level. If your looking for first person shooter game that has current military type play than this is a good buy.
Tokenized training example : [6, 128, 4, 258, 145, 70, 758, 6, 19, 119, 67, 1, 854, 8, 2829, 22, 11, 9, 58, 1, 325, 56, 382, 991, 5, 13, 6, 19, 179, 6, 111, 7, 9, 3, 139, 8, 55, 43, 42, 47, 2, 296, 86, 357, 7, 194, 200, 57, 1073, 4, 32, 120, 1, 135, 6, 111, 23, 343, 31, 291, 11, 5, 41, 16, 297, 135, 4, 466, 48, 469, 20, 37, 1073, 132, 26, 37, 244, 14, 56, 382, 991, 5, 13, 43, 1517, 3046, 499, 34, 75, 11, 9, 3, 42, 87]
此时我在数据矢量化方面遇到了一些困难。我很清楚,在将数据作为LSTM网络的输入传递之前,我必须对数据进行矢量化,但我不确定是否应该使用:
第一选择有哪些好处和缺点,第二选择的好处和缺点是什么?
注意:在我得到矢量之后,我计划使用K-means聚类算法以获得具有类似评论的用户,因此我的模型所做的预测更有效。
如果有人有任何建议如何解决问题,或者我有什么遗失,我会非常感谢帮助