CountVectorizer将单词转换为小写

时间:2018-03-20 09:52:02

标签: python scikit-learn countvectorizer

在我的分类模型中,我需要保持大写字母,但是当我使用sklearn countVectorizer来构建词汇表时,大写字母会转换为小写字母!

为了排除隐式的tokinization,我构建了一个tokenizer,它只传递文本而不需要任何操作..

我的代码:

co = dict()

def tokenizeManu(txt):
    return txt.split()

def corpDict(x):
    print('1: ', x)
    count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
    countFit = count.fit_transform(x)
    vocab = count.get_feature_names()
    dist = np.sum(countFit.toarray(), axis=0)
    for tag, count in zip(vocab, dist):
        co[str(tag)] = count

x = ['I\'m John Dev', 'We are the only']

corpDict(x)
print(co)

输出:

1:  ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after

2 个答案:

答案 0 :(得分:3)

如文档中所述,herelowercase的参数True默认为lowercase=False。要禁用此行为,您需要按如下方式设置count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)

filter

答案 1 :(得分:1)

您可以将lowercase属性设置为False

count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu,lowercase=False)

这里是CountVectorizer

的属性
CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)