Question

我有一个0和1的稀疏矩阵，它们是训练数据= numpy 2d数组。

我想只保留前K个功能来描述我的数据。

我想根据频率计算前K个特征，即它们在整个矩阵中的训练样本中出现的频率。

但是，我没有这些功能的确切名称。它们只是列。

如何计算它们的频率，最重要的是，如何选择矩阵中的前K个特征并删除其他特征？

Answer 1

Scipy稀疏矩阵可以 - 令人烦恼的倾向于返回matrix而不是array个对象 - 在很多方面都像arrays一样使用，所以提取特征频率并找到顶部比方说，4：

>>> features_present_in_sample = [[1,5], [0,3,7], [1,2], [0,4,6], [2,6]]
>>> features_per_sample=[len(s) for s in features_present_in_sample]
>>> features_flat = np.r_[tuple(features_present_in_sample)]
>>> boundaries = np.r_[0, np.add.accumulate(features_per_sample)]
>>> nsaamples = len(features_present_in_sample)
>>> nfeatures = np.max(features_flat) + 1
>>> data = sparse.csr_matrix((np.ones_like(features_flat), features_flat, boundaries), (nsaamples, nfeatures))
>>>
>>> data
<5x8 sparse matrix of type '<class 'numpy.int64'>'
        with 12 stored elements in Compressed Sparse Row format>
>>> data.todense()
matrix([[0, 1, 0, 0, 0, 1, 0, 0],
        [1, 0, 0, 1, 0, 0, 0, 1],
        [0, 1, 1, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 1, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 1, 0]])
>>> frequencies = data.mean(axis=0)
>>> frequencies
matrix([[ 0.4,  0.4,  0.4,  0.2,  0.2,  0.2,  0.4,  0.2]])
>>> top4 = np.argpartition(-frequencies.A.ravel(), 4)[:4]
>>> top4
array([6, 0, 2, 1])

删除其他人：

>>> one_hot_top4 = np.zeros((nfeatures, 4), dtype=int)
>>> one_hot_top4[top4, np.arange(4)] = 1
>>> data @ one_hot_top4
array([[0, 0, 0, 1],
       [0, 1, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 0, 0],
       [1, 0, 1, 0]], dtype=int64)

或（更好）：

>>> one_hot_top4_sparse = sparse.csc_matrix((np.ones((4,), dtype=int), top4, np.arange(4+1)), (nfeatures, 4))
>>> data @ one_hot_top4_sparse
<5x4 sparse matrix of type '<class 'numpy.int64'>'
        with 8 stored elements in Compressed Sparse Row format>
>>> (data @ one_hot_top4_sparse).todense()
matrix([[0, 0, 0, 1],
        [0, 1, 0, 0],
        [0, 0, 1, 1],
        [1, 1, 0, 0],
        [1, 0, 1, 0]], dtype=int64)

根据频率对功能进行排名

1 个答案: