tf.nn.embedding_lookup无法正常工作并出现错误

时间:2018-09-10 07:34:42

标签: python numpy tensorflow rnn

最近,我开始对IMDB电影评论数据集进行一些测试,我尝试使用我的代码,但不幸的是,我收到了诸如以下的错误:

TypeError: Value passed to parameter 'indices' has DataType float32 not in list of allowed values: int32, int64

我从https://www.kaggle.com/utathya/imdb-review-dataset/version/1#下载了我的数据集。

我的代码如下:

import numpy as np
import tensorflow as tf
from os import listdir
from os.path import isfile, join
import csv
import matplotlib.pyplot as plt
import re
tf.reset_default_graph()





wordsList = open('glove.6B.50d.txt', "r", encoding="utf-8").read().splitlines()
print('Loaded the word list!')

NewwordsList = [word.split()[0] for word in wordsList]
NewwordsVector = [vector.split()[1:] for vector in wordsList]
NewwordsVector = np.asarray(NewwordsVector)

#print(len(NewwordsList))
#print(NewwordsVector.shape)

f = open("imdb_master.csv", "r", encoding="utf8", errors="replace")
csvfiles = csv.reader(f)
numWords = []
positiveFiles = []
negativeFiles = []
for file in csvfiles:
    if file[3] in "pos":
        numWords.append(len(file[2].split()))
        positiveFiles.append(file[2])

    if file[3] in "neg":
        numWords.append(len(file[2].split()))
        negativeFiles.append(file[2])

numFiles = len(numWords)


strip_special_chars = re.compile("[^A-Za-z0-9 ]+")

def cleanSentences(string):
    string = string.lower().replace("<br />", " ")
    return re.sub(strip_special_chars, "", string.lower())

maxSeqLength = 250


totalFiles = np.zeros((numFiles,maxSeqLength), dtype="int32")
fileCounter = 0
for pf in positiveFiles[0:4]:
    indexCounter = 0
    cleanedLine = cleanSentences(pf)
    split = cleanedLine.split()
    for word in split:
        try:
            totalFiles[fileCounter][indexCounter] = NewwordsList.index(word)
        except ValueError:
            totalFiles[fileCounter][indexCounter] = 399999 
        indexCounter = indexCounter + 1
        if indexCounter >= maxSeqLength:
            break
    fileCounter = fileCounter + 1 

for nf in negativeFiles[0:4]:
    indexCounter = 0
    cleanedLine = cleanSentences(nf)
    split = cleanedLine.split()
    for word in split:
        try:
            totalFiles[fileCounter][indexCounter] = NewwordsList.index(word)
        except ValueError:
            totalFiles[fileCounter][indexCounter] = 399999 
        indexCounter = indexCounter + 1
        if indexCounter >= maxSeqLength:
            break
    fileCounter = fileCounter + 1 

#Here we define our model for Tensorflow

batchSize = 24
lstmUnits = 64
numClasses = 2
iterations = 100000
maxDimensionLength = 300

#Placeholders for both input to network and labels
input_data = tf.placeholder(tf.float32,[batchSize, maxSeqLength])
labels = tf.placeholder(tf.float32,[batchSize,numClasses])

data = tf.Variable(tf.zeros([batchSize, maxSeqLength, maxDimensionLength]),dtype=tf.float32)
data = tf.nn.embedding_lookup(NewwordsVector,input_data)

我试图理解RNN模型,所以我尝试使用我下载的数据集构建该模型。我在网站上遵循了很多指导,但没有帮助。

我的代码以这种方式工作:

1)我从手套(https://nlp.stanford.edu/projects/glove/)中加载word2vec度量。 -我使用的文件是Wikipedia 2014 + Gigaword 5(6B令牌,400K vocab,无大小写,50d,100d,200d和300d向量,822 MB下载):Gloves.6B.zip。

2)然后我将其分为单词列表和向量列表。

3)然后,我为IMDB加载数据集,并出于情感分析的目的列出了肯定和否定类别的列表。

4)之后,我替换了所有括号和不需要的字符。

5)然后,我为肯定和否定的每个句子找到索引,并为每个句子建立向量。

6)我的“ totalFiles”变量包含用于IMDB数据集中句子的正面和负面评论的向量列表。

7)获得矢量后,我的““问题”“从这里开始,我尝试使用tf.nn.embedding_lookup(vector_argument,input_argument)将我的索引映射到矢量列表中(显然,我的矢量包含索引tf.nn.embedding_lookup有助于将索引映射到手套内的向量上。

在这里,我的问题从第93行开始。从1-92开始,除93之外,其他所有信息都是正确的。当我使用tf.nn.embedding_lookup()时,出现了错误。谁能指出我的代码有什么问题吗?请先谢谢你

0 个答案:

没有答案