鉴于以下文件清单:
docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]
我想使用scikit(CountVectorizer
或TfidfVectorizer
)中的任何一个矢量化程序类,'feature one'
,'feature two'
,'feature three'
和{{ 1}}应该是矩阵中表示的四个特征。
我试过了:
'feature four'
但这只会返回:
vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')
答案 0 :(得分:2)
如果您已将功能修复为
'feature one', 'feature two', 'feature three', and 'feature four'
然后你也可以使用vocabulary
param。
vocab = ['feature one', 'feature two', 'feature three', 'feature four']
vec = CountVectorizer(vocabulary=vocab)
X = vec.fit_transform(docs)
vec.get_feature_names()
Out[310]:
['feature one',
'feature two',
'feature three',
'feature four']
答案 1 :(得分:1)
In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')
In [296]: X = vec.fit_transform(docs)
In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']
您可能还想考虑使用ngram_range=(2,2)
,这会产生以下结果:
In [308]: vec = CountVectorizer(ngram_range=(2,2))
In [309]: X = vec.fit_transform(docs)
In [310]: vec.get_feature_names()
Out[310]:
['feature four',
'feature one',
'feature three',
'feature two',
'one feature',
'two feature']