我正在尝试在数据框的列中应用RegexpTokenizer。
数据框:
all_cols
0 who is your hero and why
1 what do you do to relax
2 can't stop to eat
4 how many hours of sleep do you get a night
5 describe the last time you were relax
脚本:
import re
import nltk
import pandas as pd
from nltk import RegexpTokenizer
#tokenization of data and suppression of None (NA)
df['all_cols'].dropna(inplace=True)
tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].apply(tokenizer)
错误:
TypeError:“ RegexpTokenizer”对象不可调用
但是我不明白。当我使用其他nltk标记化模式word_tokenize时,效果很好...
答案 0 :(得分:2)
请注意,在调用RegexpTokenizer
时,您只是在创建带有一组参数的类的实例(调用其__init__
方法)。
为了使用指定的模式实际标记化dataframe列,您必须调用其RegexpTokenizer.tokenize
方法:
tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].map(tokenizer.tokenize)
all_cols
0 [who, is, your, hero, and, why]
1 [what, do, you, do, to, relax]
...
答案 1 :(得分:1)
首先要删除丢失的值,必须使用带有指定列名的DataFrame.dropna
,然后使用tokenizer.tokenize
方法,因为您的解决方案不能删除丢失的值:
df = pd.DataFrame({'all_cols':['who is your hero and why',
'what do you do to relax',
"can't stop to eat", np.nan]})
print (df)
all_cols
0 who is your hero and why
1 what do you do to relax
2 can't stop to eat
3 NaN
#solution remove missing values from Series, not rows from df
df['all_cols'].dropna(inplace=True)
print (df)
all_cols
0 who is your hero and why
1 what do you do to relax
2 can't stop to eat
3 NaN
#solution correct remove rows by missing values
df.dropna(subset=['all_cols'], inplace=True)
print (df)
all_cols
0 who is your hero and why
1 what do you do to relax
2 can't stop to eat
tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].apply(tokenizer.tokenize)
print (df)
all_cols
0 [who, is, your, hero, and, why]
1 [what, do, you, do, to, relax]
2 [can't, stop, to, eat]