Question

以下是我所谈论内容的一个最小示例：

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

data = fetch_20newsgroups()
x = data.data

vec = TfidfVectorizer(min_df=0.01, max_df=0.5)
mat = vec.fit_transform(x).astype('bool')

vec.set_params(binary=True)
print(np.array_equal(mat, vec.fit_transform(x)))

这将打印False。设置binary=True与将所有非零值设置为True之间的根本区别是什么？

编辑：@ juanpa.arrivillaga回答，TfidfVectorizer(binary=True)仍在进行逆文档频率计算。但是，我也注意到CountVectorizer(binary=True)也不产生与.astype('bool')相同的输出。下面是一个示例：

In [1]: import numpy as np
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: data = fetch_20newsgroups()
   ...: x = data.data
   ...:
   ...: vec = CountVectorizer(min_df=0.01, max_df=0.5)
   ...: a = vec.fit_transform(x).astype('bool')
   ...:
   ...: vec.set_params(binary=True)
   ...: b = vec.fit_transform(x).astype('bool')
   ...: print(np.array_equal(a, b))
   ...:
False

In [2]: a
Out[2]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

维度和dtype相同，这使我相信这些矩阵的内容是不同的。仅仅关注print(a)和print(b)的输出，它们看起来就一样。

Answer 1

您从根本上混淆了两件事。

一种是转换为boolean numpy数据类型，它等同于python数据类型，该数据类型接受两个值True和False，不同之处在于它在基础原始数组中表示为单个字节。

将#include <stdio.h> #include <stdlib.h> #include <ctype.h> #include <math.h> #include <string.h> #define STR_LEN 256 int main() { char t1[STR_LEN]; char digits[STR_LEN]; char *dpt; char *src; char *dst; int isrc, idig; int len; int chr; double x; fgets(t1, sizeof(t1), stdin); src = t1; dst = digits; for (chr = *src++; chr != 0; chr = *src++) { if (isdigit(chr)) { *dst++ = chr; continue; } if (chr == '.') { *dst++ = chr; continue; } } *dst = 0; x = strtod(digits, &dpt); int testOutput = strlen(digits); printf("%s %d %lf\n", digits, testOutput, x); return 0; }参数传递给binary会改变数据的建模方式。简而言之，如果您使用TfidfVectorizer，则总数将是二进制的，即可见或不可见。然后执行通常的tf-id转换。 From the docs：

如果为True，则所有非零项计数均设置为1。这并不意味着输出将只有0/1值，只有tf-idf中的tf项是二进制（将idf和normalization设置为False以获得0/1输出。）

所以您甚至都没有布尔输出。

所以请考虑：

binary=True

现在请注意，使用In [10]: import numpy as np ...: from sklearn.feature_extraction.text import TfidfVectorizer ...: In [11]: data = [ ...: 'The quick brown fox jumped over the lazy dog', ...: 'how much wood could a woodchuck chuck if a woodchuck could chuck wood' ...: ] In [12]: TfidfVectorizer().fit_transform(data).todense() Out[12]: matrix([[ 0.30151134, 0. , 0. , 0.30151134, 0.30151134, 0. , 0. , 0.30151134, 0.30151134, 0. , 0.30151134, 0.30151134, 0.60302269, 0. , 0. ], [ 0. , 0.45883147, 0.45883147, 0. , 0. , 0.22941573, 0.22941573, 0. , 0. , 0.22941573, 0. , 0. , 0. , 0.45883147, 0.45883147]]) In [13]: TfidfVectorizer().fit_transform(data).todense().astype('bool') Out[13]: matrix([[ True, False, False, True, True, False, False, True, True, False, True, True, True, False, False], [False, True, True, False, False, True, True, False, False, True, False, False, False, True, True]], dtype=bool)仍将返回浮点类型：

binary

它只是改变结果。

为什么vectorizer.fit_transform（x）.astype（'bool'）与vectorizer.set_params（binary = True）.fit_transform（x）不同？

1 个答案: