我有一组网页,并且我正在获取网页计数矩阵。我尝试使用sklearn的标准Countvectorizer,但未获得所需的结果。示例代码如下:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['www.google.com www.google.com', 'www.google.com www.facebook.com', 'www.google.com', 'www.facebook.com']
vocab = {'www.google.com':0, 'www.facebook.com':1}
vectorizer = CountVectorizer(vocabulary=vocab)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
它给出
['www.google.com', 'www.facebook.com']
[[0 0]
[0 0]
[0 0]
[0 0]]
但是所需的结果是
['www.google.com', 'www.facebook.com']
[[2 0]
[1 1]
[1 0]
[0 1]]
如何在这样的自定义词汇表上应用countvectorizer?
答案 0 :(得分:0)
As per the input from a related question, the issue occured because of the tokenizer. A customer tokenizer was written and now it works.
def mytokenizer(text):
return text.split()
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['www.google.com www.google.com', 'www.google.com www.facebook.com', 'www.google.com', 'www.facebook.com']
vocab = {'www.google.com':0, 'www.facebook.com':1}
vectorizer = CountVectorizer(vocabulary=vocab, tokenizer = mytokenizer)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())