我正在尝试创建一个用于NLP的字典,其中最终输出应该类似于[{text: "blah blah blah"}, "positive"]
但是当我尝试创建"text: blah blah blah"
的字典时,即使我正在处理列表,我得到的输出也只有一个条目。
以下是设置代码。
training_text = []
training_tag = []
with open("training.csv", encoding="ISO-8859-1") as csvfile:
list_reader = csv.reader(csvfile)
for row in list_reader:
text=row[0]
tag=row[1]
training_text.append(text)
training_tag.append(tag)
training_text_stem = []
for doc in training_text[1:]: #skip first row, which is header
#tokenize text
tok = nltk.word_tokenize(doc)
text = nltk.Text(tok)
#normalize words
words = [w.lower() for w in text if w.isalpha()]
#build vocabulary
vocab = sorted(set(words))
#remove stopwords
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
vocab_redux = [w for w in vocab if w not in stopwords]
#stemming to reduce topically similar words to their root
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()
vocab_stem = [p_stemmer.stem(i) for i in vocab_redux]
training_text_stem.append(vocab_stem)
这是它崩溃的地方。我已经尝试过2种方式,作为Dict-Zip理解,以及for循环。在这两种情况下,输出只是一个条目,而不是整个列表。
key = ['text']*len(training_text_stem)
training_dictionary = dict(zip(training_text_stem, key))
def makeadictionary(document):
dictionarylist = []
for doc in document:
dictionarylist.append({'text': doc})
return(dictionarylist)
makeadictionary(training_text_stem)