以下是我所谈论内容的一个最小示例:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
data = fetch_20newsgroups()
x = data.data
vec = TfidfVectorizer(min_df=0.01, max_df=0.5)
mat = vec.fit_transform(x).astype('bool')
vec.set_params(binary=True)
print(np.array_equal(mat, vec.fit_transform(x)))
这将打印False
。设置binary=True
与将所有非零值设置为True
之间的根本区别是什么?
编辑:@ juanpa.arrivillaga回答,TfidfVectorizer(binary=True)
仍在进行逆文档频率计算。但是,我也注意到CountVectorizer(binary=True)
也不产生与.astype('bool')
相同的输出。下面是一个示例:
In [1]: import numpy as np
...: from sklearn.datasets import fetch_20newsgroups
...: from sklearn.feature_extraction.text import CountVectorizer
...:
...: data = fetch_20newsgroups()
...: x = data.data
...:
...: vec = CountVectorizer(min_df=0.01, max_df=0.5)
...: a = vec.fit_transform(x).astype('bool')
...:
...: vec.set_params(binary=True)
...: b = vec.fit_transform(x).astype('bool')
...: print(np.array_equal(a, b))
...:
False
In [2]: a
Out[2]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
with 950068 stored elements in Compressed Sparse Row format>
In [3]: b
Out[3]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
with 950068 stored elements in Compressed Sparse Row format>
维度和dtype相同,这使我相信这些矩阵的内容是不同的。仅仅关注print(a)
和print(b)
的输出,它们看起来就一样。
答案 0 :(得分:3)
您从根本上混淆了两件事。
一种是转换为boolean numpy数据类型,它等同于python数据类型,该数据类型接受两个值True和False,不同之处在于它在基础原始数组中表示为单个字节。
将#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <math.h>
#include <string.h>
#define STR_LEN 256
int
main()
{
char t1[STR_LEN];
char digits[STR_LEN];
char *dpt;
char *src;
char *dst;
int isrc, idig;
int len;
int chr;
double x;
fgets(t1, sizeof(t1), stdin);
src = t1;
dst = digits;
for (chr = *src++; chr != 0; chr = *src++) {
if (isdigit(chr)) {
*dst++ = chr;
continue;
}
if (chr == '.') {
*dst++ = chr;
continue;
}
}
*dst = 0;
x = strtod(digits, &dpt);
int testOutput = strlen(digits);
printf("%s %d %lf\n", digits, testOutput, x);
return 0;
}
参数传递给binary
会改变数据的建模方式。简而言之,如果您使用TfidfVectorizer
,则总数将是二进制的,即可见或不可见。 然后执行通常的tf-id转换。 From the docs:
如果为True,则所有非零项计数均设置为1。这并不意味着 输出将只有0/1值,只有tf-idf中的tf项是 二进制(将idf和normalization设置为False以获得0/1输出。)
所以您甚至都没有布尔输出。
所以请考虑:
binary=True
现在请注意,使用In [10]: import numpy as np
...: from sklearn.feature_extraction.text import TfidfVectorizer
...:
In [11]: data = [
...: 'The quick brown fox jumped over the lazy dog',
...: 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'
...: ]
In [12]: TfidfVectorizer().fit_transform(data).todense()
Out[12]:
matrix([[ 0.30151134, 0. , 0. , 0.30151134, 0.30151134,
0. , 0. , 0.30151134, 0.30151134, 0. ,
0.30151134, 0.30151134, 0.60302269, 0. , 0. ],
[ 0. , 0.45883147, 0.45883147, 0. , 0. ,
0.22941573, 0.22941573, 0. , 0. , 0.22941573,
0. , 0. , 0. , 0.45883147, 0.45883147]])
In [13]: TfidfVectorizer().fit_transform(data).todense().astype('bool')
Out[13]:
matrix([[ True, False, False, True, True, False, False, True, True,
False, True, True, True, False, False],
[False, True, True, False, False, True, True, False, False,
True, False, False, False, True, True]], dtype=bool)
仍将返回浮点类型:
binary
它只是改变结果。