仅从sklearn CountVectorizer稀疏矩阵中过滤某些单词

时间:2016-04-06 14:49:55

标签: python pandas scikit-learn sparse-matrix

我有一个充满了文字的熊猫系列。在CountVectorizer包中使用sklearn函数,我计算了稀疏矩阵。我也确定了最重要的词。现在我想只为那些顶级单词过滤我的稀疏矩阵。

原始数据包含的行数超过7000,且包含的75000个字数超过from sklearn.feature_extraction.text import CountVectorizer import pandas as pd words = pd.Series(['This is first row of the text column', 'This is second row of the text column', 'This is third row of the text column', 'This is fourth row of the text column', 'This is fifth row of the text column']) count_vec = CountVectorizer(stop_words='english') sparse_matrix = count_vec.fit_transform(words) 个。因此我在这里创建一个示例数据

.toarray()

我为该列中的所有单词创建了稀疏矩阵。这里只是打印我的稀疏矩阵,我使用print count_vec.get_feature_names() print sparse_matrix.toarray() [u'column', u'fifth', u'fourth', u'row', u'second', u'text'] [[1 0 0 1 0 1] [1 0 0 1 1 1] [1 0 0 1 0 1] [1 0 1 1 0 1] [1 1 0 1 0 1]] 函数将其转换为数组。

# Get frequency count of all features
features_count = sparse_matrix.sum(axis=0).tolist()[0]
features_names = count_vec.get_feature_names()
features = pd.DataFrame(zip(features_names, features_count), 
                                columns=['features', 'count']
                               ).sort_values(by=['count'], ascending=False)

  features  count
0   column      5
3      row      5
5     text      5
1    fifth      1
2   fourth      1
4   second      1

现在我正在寻找使用以下

经常出现的单词
column

从上面的结果我们知道频繁出现的字词是rowtext& vocabulary。现在我想只为这些单词过滤我的稀疏矩阵。我不会将我的稀疏矩阵转换为数组然后过滤。因为我的原始数据中存在内存错误,因为单词的数量非常多。

我能够获得稀疏矩阵的唯一方法是再次使用countvec_subset = CountVectorizer(vocabulary= ['column', 'text', 'row']) 属性重复这些特定单词的步骤,就像这样

import matplotlib.pyplot as plt
import numpy as np

def synthimage():
    w,h = 300,200
    im = np.random.randint(0,255,(w,h,3))/255
    xa = np.random.randint(50,w-60)
    xb = xa + np.random.randint(50,90)
    ya = np.random.randint(50,h-60)
    yb = ya + np.random.randint(20,50)
    im[xa:xb,ya] = np.array([1,0,0])
    im[xa:xb,yb] = np.array([1,0,0])
    im[xa,ya:yb] = np.array([1,0,0])
    im[xb,ya:yb] = np.array([1,0,0])
    return im

def getRectPoints(im):
    x,y = [],[]
    for i in range(im.shape[0]):
        for j in range(im.shape[1]):
            if (im[i,j]-np.array([1,0,0])).sum()==0:
                x.append(i)
                y.append(j)
    return np.array(x),np.array(y)

def denoise(x,y):
    nx,ny = [],[]
    for i in range(x.shape[0]):
        d = np.sqrt((x[i]-x)**2+(y[i]-y)**2)
        m = d<2
        if len(m.nonzero()[0])>2:
            nx.append(x[i])
            ny.append(y[i])
    return np.array(nx),np.array(ny)

im = synthimage()    
plt.imshow(np.swapaxes(im,0,1),origin='lower',interpolation='nearest')
plt.show()

x,y = getRectPoints(im)
plt.scatter(x,y,c='red')
plt.xlim(0,300)
plt.ylim(0,200)
plt.show()

nx,ny = denoise(x,y)
plt.scatter(nx,ny,c='red')
plt.xlim(0,300)
plt.ylim(0,200)
plt.show()

#Assuming rectangle has no rotation (otherwise check Scipy ConveHull)
xmi = nx.min()
xma = nx.max()
ymi = ny.min()
yma = ny.max()

new = np.ones(im.shape)
new[xmi:xma,ymi:yma] = im[xmi:xma,ymi:yma]
plt.imshow(np.swapaxes(new,0,1),origin='lower',interpolation='nearest')
plt.show()

相反,我正在寻找一个更好的解决方案,我可以直接为这些单词过滤稀疏矩阵,而不是从头开始再次创建它。

1 个答案:

答案 0 :(得分:4)

您可以使用切片稀疏矩阵。您需要派生用于切片的列。 sparse_matrix[:, columns]

In [56]: feature_count = sparse_matrix.sum(axis=0)

In [57]: columns = tuple(np.where(feature_count == feature_count.max())[1])

In [58]: columns
Out[58]: (0, 3, 5)

In [59]: sparse_matrix[:, columns].toarray()
Out[59]:
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int64)

In [60]: type(sparse_matrix[:, columns])
Out[60]: scipy.sparse.csr.csr_matrix

In [71]: np.array(features_names)[list(columns)]
Out[71]:
array([u'column', u'row', u'text'],
      dtype='<U6')

切片子集仍为scipy.sparse.csr.csr_matrix