用于scikit-learn矢量化程序的自定义标记器

时间:2018-02-23 00:18:53

标签: python scikit-learn

鉴于以下文件清单:

docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]

我想使用scikit(CountVectorizerTfidfVectorizer)中的任何一个矢量化程序类,'feature one''feature two''feature three'和{{ 1}}应该是矩阵中表示的四个特征。

我试过了:

'feature four'

但这只会返回:

vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')

2 个答案:

答案 0 :(得分:2)

如果您已将功能修复为

'feature one', 'feature two', 'feature three', and 'feature four'

然后你也可以使用vocabulary param。

vocab = ['feature one', 'feature two', 'feature three', 'feature four']
vec = CountVectorizer(vocabulary=vocab)

X = vec.fit_transform(docs)
vec.get_feature_names()
Out[310]:
['feature one',
 'feature two',
 'feature three',
 'feature four']

答案 1 :(得分:1)

In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')

In [296]: X = vec.fit_transform(docs)

In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']

您可能还想考虑使用ngram_range=(2,2),这会产生以下结果:

In [308]: vec = CountVectorizer(ngram_range=(2,2))

In [309]: X = vec.fit_transform(docs)

In [310]: vec.get_feature_names()
Out[310]:
['feature four',
 'feature one',
 'feature three',
 'feature two',
 'one feature',
 'two feature']