Question

我正在使用TFIDF稀疏矩阵进行文档分类，并且只希望保留每个文档的前n个（例如50个）术语（按TFIDF评分排名）。请参见下面的编辑。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                              token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50

df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])

df_t
Out[15]: 
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
    with 6055621 stored elements in Compressed Sparse Row format>

我尝试遵循this post中的示例，尽管我的目的不是显示功能，而只是在训练之前为每个文档选择前n个。但由于数据太大而无法转换为密集矩阵，因此出现内存错误。

df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):

  File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
    df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
    out = self._process_toarray_args(order, out)

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

有什么方法可以做我想要的事情，而无需处理密集的表示（即，无需toarray()调用），而又不会比我已经拥有的功能空间（用min_df减少的空间）减少太多？

注意：max_features不是我想要的参数，因为它仅考虑“整个语料库中按词条频率排名最高的max_features ”（文档here）以及什么我想要的是文档级排名。

编辑：我想知道解决此问题的最佳方法是否是将所有功能的值（n最佳除外）设置为零。我之所以这样说，是因为词汇量已经计算完了，所以特征索引必须保持不变，因为我想将其用于其他目的（例如，可视化对应于 n -best的实际单词功能）。

一位同事编写了一些代码来检索 n 个排名最高的功能的索引：

n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
    tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n

但是从那里，我需要：

获取剩余（即排名最低）索引的列表并修改“就地”值，或者
遍历原始矩阵（df_t），并将所有值设置为0，tops中的 n 最佳索引除外。

有here帖子解释了如何使用csr_matrix，但是我不确定如何将其付诸实践以得到我想要的东西。

Answer 1

您已经提到，TfidfVectorizer的max_features参数是选择功能的一种方法。

如果您正在寻找一种替代方法来考虑与目标变量的关系，则可以使用sklearn的SelectKBest。通过设置k=50，这将为最佳功能过滤数据。可以将用于选择的度量标准指定为参数score_func。

示例：

from sklearn.feature_selection import SelectKBest

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                          token_pattern='[A-Za-z][\w\-]*', max_df=0.25)

df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])

您还可以将其链接到管道中：

pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
                     ("feature_reduction", SelectKBest(k=50)),
                     ("classifier", classifier)])

Answer 2

您可以将numpy数组分成多个，以释放内存。然后，让他们

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train').data

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                                  token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)

n = 10

df_t = tfidfvectorizer.fit_transform(data)

df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
          for i in range(0, df_t.shape[0], 500)]

np.concatenate(df_top, axis=0).shape
>> (11314, 10)

Answer 3

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])

TfidfVectorizer 中传递的 max_features 参数将选择按其 TFIDF得分排序的前50个功能。您可以使用以下功能查看功能：

print(vect.get_feature_names())

为给定文档选择前n种TFIDF功能

3 个答案: