使用NLTK使单词词化时出现类型错误

时间:2018-06-21 19:48:13

标签: python-3.x nltk tokenize wordnet lemmatization

我已经解析了30个excel文件并创建了一个熊猫数据框。我已经对单词进行了标记,取出了停用词并做了二元组。但是,当我尝试对其进行词法化时,会出现以下错误:TypeError:无法散列的类型:'list' 这是我的代码:

# Use simple pre-proces to clean up data and tokenize
def sent_to_words(sentences):
    for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

data_words = list(sent_to_words(data))

# Define Function for Removing stopwords
def remove_stopwords(texts):
    return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

# Define function for bigrams
def make_bigrams(texts):
    return[bigram_mod[doc] for doc in texts]

#Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

#Define function for lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
    return WordNetLemmatizer().lemmatize(word)

#Lemmatize words
data_lemmatized = get_lemma(data_words_bigrams)

这正是我得到错误的地方。我应该如何调整代码以解决此问题?预先谢谢你

根据建议,数据框的前几行

df.head()

dataframe snap

0 个答案:

没有答案