对熊猫进行标记化列

时间:2020-01-02 17:15:48

标签: pandas nltk lemmatization

我正在尝试对标记化的列comment_tokenized进行

enter image description here

我愿意:

import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)

但是有

TypeError: unhashable type: 'list'

我该怎么做才能用一袋字词对一列进行脱词?

还有如何避免令牌化将[不]划分为[不,不]的问题?

1 个答案:

答案 0 :(得分:1)

您的功能接近了!由于您在该系列中使用apply,因此不需要专门调出函数中的列。您也完全没有在函数中使用输入text。所以改变

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]  ##Notice the use of text.

一个例子:

df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
                             A
0  [cats, cacti, geese, rocks]

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]

df['A'].apply(lemmatize_text)

0    [cat, cactus, goose, rock]