Question

我试图理解以下代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我尝试打印X以查看将返回的内容时，我得到了这个结果：

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

但是，我不明白这个结果的含义？

Answer 1

您可以将其解释为“（句子索引，要素索引）计数”

有3个句子：它从0开始到2结束

功能索引是您可以从vectorizer.vocabulary _

获得的单词索引

->词汇表_字典{word：feature_index，...}

例如（0，1）1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

如果您使用tfidf矢量化器see here而不是计数矢量化器，它将给出u tfidf值。我希望我说清楚了

Answer 2

@Himanshu写道，这是“（句子索引，要素索引）计数”

这里，计数部分是“单词在文档中出现的次数”

例如

（0，1）1

（0，2）1

（0，6）1

（0，3）1

（0，8）1

（1，5）2 仅在此示例中，计数“ 2”表明单词“ and”在此文档/句子中出现了两次

（1，1）1

（1，6）1

（1，3）1

（1，8）1

（2，4）1

（2，7）1

（2，0）1

（2，6）1

（3，1）1

（3，2）1

（3，6）1

（3，3）1

（3，8）1

让我们更改代码中的语料库。基本上，我在语料库列表的第二句话中两次添加了“ second”一词。

    constructor(
        private ref: ChangeDetectorRef
    ) {
    }

   filter() {
     this.filtering=true;
     this.ref.detectChanges();

      //Do some filtering
      this.filtering=false;
   }

（0，1）1

（0，2）1

（0，6）1

（0，3）1

（0，8）1

（1，5）4 对于修改后的语料，计数“ 4”表示单词“ second”在此文档/句子中出现了四次

（1，1）1

（1，6）1

（1，3）1

（1，8）1

（2，4）1

（2，7）1

（2，0）1

（2，6）1

（3，1）1

（3，2）1

（3，6）1

（3，3）1

（3，8）1

Answer 3

它将文本转换为数字。因此，借助其他功能，您将能够计算出每个单词在给定数据集中存在的次数。我是编程新手，所以也许还有其他领域可以使用。

矢量化fit_transform如何在sklearn中工作？

3 个答案: