我正在尝试使用自定义分析器创建术语文档矩阵,以从文档中提取特征。以下是相同的代码:
vectorizer = CountVectorizer( \
ngram_range=(1,2),
)
analyzer=vectorizer.build_analyzer()
def customAnalyzer(text):
grams = analyzer(text)
tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)]
return tgrams
调用此函数来创建自定义分析器,countVectorizer使用它来提取特征。
for i in xrange( 0, num_rows ):
clean_query.append( review_to_words( inp["keyword"][i] , units))
vectorizer = CountVectorizer(analyzer = customAnalyzer, \
tokenizer = None, \
ngram_range=(1,2), \
preprocessor = None, \
stop_words = None, \
max_features = n,
)
features = vectorizer.fit_transform(clean_query)
z = vectorizer.get_feature_names()
此调用会引发以下错误:
(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',))
当我们调用矢量化器进行拟合和变换时会出现此错误。 但变量clean_query的值不是标量。我正在使用sklearn-0.17.1
np.isscalar(clean_query)
False
答案 0 :(得分:-1)
这是我为重现错误而做的一个小测试,但它没有给我带来同样的错误。 (此示例取自:scikit-learn Feature extraction)
scikit-learn version : 0.19.dev0
In [1]: corpus = [
...: ... 'This is the first document.',
...: ... 'This is the second second document.',
...: ... 'And the third one.',
...: ... 'Is this the first document?',
...: ... ]
In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
In [3]: vectorizer = TfidfVectorizer(min_df=1)
In [4]: vectorizer.fit_transform(corpus)
Out[4]:
<4x9 sparse matrix of type '<type 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
In [5]: import numpy as np
In [6]: np.isscalar(corpus)
Out[6]: False
In [7]: type(corpus)
Out[7]: list
从上面的代码可以看出,语料库不是标量并且具有类型列表。
我认为您的解决方案在于创建clean_query
变量,正如vectorizer.fit_transform
函数所期望的那样。