将CountVectorizer和TfidfTransformer稀疏矩阵转换为单独的Pan​​das数据帧行

时间:2017-05-13 20:25:07

标签: python pandas dataframe scikit-learn sparse-matrix

问题:将sklearn的CountVectorizer和TfidfTransformer导致的稀疏矩阵转换为Pandas DataFrame列的最佳方法是什么?每个bigram都有一个单独的行及其相应的频率和tf-idf得分?

管道:从SQL DB中提取文本数据,将文本拆分为双字节并计算每个文档的频率和每个文档的每个bigram的tf-idf,将结果加载回SQL DB。

当前状态:

引入两列数据(numbertext)。清除text以生成第三列cleanText

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

将此DataFrame输入到sklearn的特征提取中:

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

然后在将矩阵转换为数组后将矩阵反馈到原始DataFrame中:

data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())

输出:

   number                               text              cleanText  \
0     123            The farmer plants grain    farmer plants grain   
1     234  The farmer and his son go fishing  farmer son go fishing   
2     345            The fisher catches tuna    fisher catches tuna   

               frequency                                        tfidf_score  

0  [0, 1, 0, 0, 0, 1, 0]  [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...  
1  [0, 0, 1, 0, 1, 0, 1]  [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...  
2  [1, 0, 0, 1, 0, 0, 0]  [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0... 

问题:

  1. 功能名称(即双字母)不在DataFrame
  2. 每个bigram
  3. frequencytfidf_score不在单独的行中

    期望输出:

           number                    bigram         frequency      tfidf_score
    0     123            farmer plants                 1              0.70  
    0     123            plants grain                  1              0.56
    1     234            farmer son                    1              0.72
    1     234            son go                        1              0.63
    1     234            go fishing                    1              0.34
    2     345            fisher catches                1              0.43
    2     345            catches tuna                  1              0.43
    

    我设法使用以下代码获取分配给DataFrame的不同行的数字列之一:

    data.reset_index(inplace=True)
    rows = []
    _ = data.apply(lambda row: [rows.append([row['number'], nn]) 
                             for nn in row.tfidf_score], axis=1)
    df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])
    

    输出:

        number  tfidf_score
    0      123     0.000000
    1      123     0.707107
    2      123     0.000000
    3      123     0.000000
    4      123     0.000000
    5      123     0.707107
    6      123     0.000000
    7      234     0.000000
    8      234     0.000000
    9      234     0.577350
    10     234     0.000000
    11     234     0.577350
    12     234     0.000000
    13     234     0.577350
    14     345     0.707107
    15     345     0.000000
    16     345     0.000000
    17     345     0.707107
    18     345     0.000000
    19     345     0.000000
    20     345     0.000000
    

    但是,我不确定如何对两个数字列执行此操作,并且这不会引入bigrams(功能名称)本身。此外,这个方法需要一个数组(这就是我首先将稀疏矩阵转换为数组的原因),如果可能的话,我想避免这种情况,因为性能问题以及我必须去除无意义的行

    非常感谢任何见解!非常感谢你花时间阅读这个问题 - 我为这个问题道歉。如果我能做些什么来改善问题或澄清我的过程,请告诉我。

1 个答案:

答案 0 :(得分:3)

可以使用CountVectorizer' s get_feature_names()捕获二元游戏名称。从那里开始,它只是一系列meltmerge操作:

print(data)

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

在这种情况下,CountVectorizer功能名称是bigrams:

print(cv.get_feature_names())

[u'catches tuna',
 u'farmer plants',
 u'farmer son',
 u'fisher catches',
 u'go fishing',
 u'plants grain',
 u'son go']

CountVectorizer.fit_transform()返回一个稀疏矩阵。我们可以将它转换为密集表示形式,将其包装在DataFrame中,然后将特征名称添加为列:

bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
bigrams['number'] = data.number
print(bigrams)

   catches tuna  farmer plants  farmer son  fisher catches  go fishing  \
0             0              1           0               0           0   
1             0              0           1               0           1   
2             1              0           0               1           0   

   plants grain  son go  number  
0             1       0     123  
1             0       1     234  
2             0       0     345  

要从宽格式转换为长格式,请使用melt() 然后将结果限制为bigram匹配(query()在这里很有用):

bigrams_long = (pd.melt(bigrams.reset_index(), 
                       id_vars=['index','number'],
                       value_name='bigram_ct')
                 .query('bigram_ct > 0')
                 .sort_values(['index','number']))

    index  number        variable  bigram_ct
3       0     123   farmer plants          1
15      0     123    plants grain          1
7       1     234      farmer son          1
13      1     234      go fishing          1
19      1     234          son go          1
2       2     345    catches tuna          1
11      2     345  fisher catches          1

现在重复tfidf的过程:

tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
tfidf['number'] = data.number

tfidf_long = pd.melt(tfidf.reset_index(), 
                     id_vars=['index','number'], 
                     value_name='tfidf').query('tfidf > 0')

最后,合并bigramstfidf

fulldf = (bigrams_long.merge(tfidf_long, 
                             on=['index','number','variable'])
                      .set_index('index'))

       number        variable  bigram_ct     tfidf
index                                             
0         123   farmer plants          1  0.707107
0         123    plants grain          1  0.707107
1         234      farmer son          1  0.577350
1         234      go fishing          1  0.577350
1         234          son go          1  0.577350
2         345    catches tuna          1  0.707107
2         345  fisher catches          1  0.707107