为什么vectorizer.fit_transform(x).astype('bool')与vectorizer.set_params(binary = True).fit_transform(x)不同?

时间:2018-11-04 20:57:34

标签: python scikit-learn

以下是我所谈论内容的一个最小示例:

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

data = fetch_20newsgroups()
x = data.data

vec = TfidfVectorizer(min_df=0.01, max_df=0.5)
mat = vec.fit_transform(x).astype('bool')

vec.set_params(binary=True)
print(np.array_equal(mat, vec.fit_transform(x)))

这将打印False。设置binary=True与将所有非零值设置为True之间的根本区别是什么?

编辑:@ juanpa.arrivillaga回答,TfidfVectorizer(binary=True)仍在进行逆文档频率计算。但是,我也注意到CountVectorizer(binary=True)也不产生与.astype('bool')相同的输出。下面是一个示例:

In [1]: import numpy as np
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: data = fetch_20newsgroups()
   ...: x = data.data
   ...:
   ...: vec = CountVectorizer(min_df=0.01, max_df=0.5)
   ...: a = vec.fit_transform(x).astype('bool')
   ...:
   ...: vec.set_params(binary=True)
   ...: b = vec.fit_transform(x).astype('bool')
   ...: print(np.array_equal(a, b))
   ...:
False

In [2]: a
Out[2]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

维度和dtype相同,这使我相信这些矩阵的内容是不同的。仅仅关注print(a)print(b)的输出,它们看起来就一样。

1 个答案:

答案 0 :(得分:3)

您从根本上混淆了两件事。

一种是转换为boolean numpy数据类型,它等同于python数据类型,该数据类型接受两个值True和False,不同之处在于它在基础原始数组中表示为单个字节。

#include <stdio.h> #include <stdlib.h> #include <ctype.h> #include <math.h> #include <string.h> #define STR_LEN 256 int main() { char t1[STR_LEN]; char digits[STR_LEN]; char *dpt; char *src; char *dst; int isrc, idig; int len; int chr; double x; fgets(t1, sizeof(t1), stdin); src = t1; dst = digits; for (chr = *src++; chr != 0; chr = *src++) { if (isdigit(chr)) { *dst++ = chr; continue; } if (chr == '.') { *dst++ = chr; continue; } } *dst = 0; x = strtod(digits, &dpt); int testOutput = strlen(digits); printf("%s %d %lf\n", digits, testOutput, x); return 0; } 参数传递给binary会改变数据的建模方式。简而言之,如果您使用TfidfVectorizer,则总数将是二进制的,即可见或不可见。 然后执行通常的tf-id转换。 From the docs

  

如果为True,则所有非零项计数均设置为1。这并不意味着   输出将只有0/1值,只有tf-idf中的tf项是   二进制(将idf和normalization设置为False以获得0/1输出。)

所以您甚至都没有布尔输出。

所以请考虑:

binary=True

现在请注意,使用In [10]: import numpy as np ...: from sklearn.feature_extraction.text import TfidfVectorizer ...: In [11]: data = [ ...: 'The quick brown fox jumped over the lazy dog', ...: 'how much wood could a woodchuck chuck if a woodchuck could chuck wood' ...: ] In [12]: TfidfVectorizer().fit_transform(data).todense() Out[12]: matrix([[ 0.30151134, 0. , 0. , 0.30151134, 0.30151134, 0. , 0. , 0.30151134, 0.30151134, 0. , 0.30151134, 0.30151134, 0.60302269, 0. , 0. ], [ 0. , 0.45883147, 0.45883147, 0. , 0. , 0.22941573, 0.22941573, 0. , 0. , 0.22941573, 0. , 0. , 0. , 0.45883147, 0.45883147]]) In [13]: TfidfVectorizer().fit_transform(data).todense().astype('bool') Out[13]: matrix([[ True, False, False, True, True, False, False, True, True, False, True, True, True, False, False], [False, True, True, False, False, True, True, False, False, True, False, False, False, True, True]], dtype=bool) 仍将返回浮点类型:

binary

它只是改变结果。