我在计数器中添加了词形还原,正如Sklearn page所述。
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, articles):
return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
strip_accents = 'unicode',
stop_words = 'english',
lowercase = True,
token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
max_df = 0.5,
min_df = 10)
但是,使用fit_transform
创建 dtm 时,我会收到以下错误(其中我没有意义)。在将词形还原添加到我的矢量化器之前,dtm代码始终有效。我深入研究了手册,并尝试了一些代码,但无法找到任何解决方案。
dtm_tf = tf_vectorizer.fit_transform(articles)
更新
在遵循下面的@ MaxU建议之后,代码运行时没有错误,但是数字和标点符号并没有从我的输出中省略。我运行单独的测试,以查看LemmaTokenizer()
之后哪些其他功能执行和不起作用。结果如下:
strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work
max_df = 0.5, # works
min_df = 10 # works
很明显,它只是token_pattern
变得不活跃。以下是没有token_pattern
的更新且有效的代码(我只需要先安装' punkt'和#39; wordnet'包):
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, articles):
return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
max_df = 0.5, # works
min_df = 10) # works
对于那些想要删除数字,标点符号和少于3个字符的单词(但不知道如何)的人,这是在使用Pandas数据帧时为我做的一种方法
# when working from Pandas dataframe
df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits
df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words
df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation
答案 0 :(得分:6)
应该是:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
# NOTE: ----------------------> ^^
而不是:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,