我正在尝试对标记化的列comment_tokenized进行
我愿意:
import nltk
from nltk.stem import WordNetLemmatizer
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)
但是有
TypeError: unhashable type: 'list'
我该怎么做才能用一袋字词对一列进行脱词?
还有如何避免令牌化将[不]划分为[不,不]的问题?
答案 0 :(得分:1)
您的功能接近了!由于您在该系列中使用apply
,因此不需要专门调出函数中的列。您也完全没有在函数中使用输入text
。所以改变
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
到
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text] ##Notice the use of text.
一个例子:
df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
A
0 [cats, cacti, geese, rocks]
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text]
df['A'].apply(lemmatize_text)
0 [cat, cactus, goose, rock]