从pandas dataFrame创建术语密度矩阵时的内存使用情况

时间:2014-03-06 07:45:50

标签: python memory pandas scikit-learn

我有一个我从csv文件保存/读取的DataFrame,我想从中创建一个Term Density Matrix DataFrame。根据herrfz的建议here,我使用sklearn的CounVectorizer。我将该代码包装在一个函数

    from sklearn.feature_extraction.text import CountVectorizer
    countvec = CountVectorizer()
    from scipy.sparse import coo_matrix, csc_matrix, hstack

    def df2tdm(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        '''
        tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
        tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
        return tdm_df

将TDM作为DataFrame返回,例如:

    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
    print df.head()
    tdm_df = df2tdm(df,'title','page')
    tdm_df.head()

       boiled  delicious  egg  else  fried  orange  potato  salad  something  \
    0       1          1    1     0      0       0       0      0          0   
    1       0          0    1     0      1       0       0      0          0   
    2       0          0    0     0      0       0       1      1          0   
    3       0          0    0     0      0       1       0      0          0   
    4       0          0    0     1      0       0       0      0          1   

       split  page  
    0      0     1  
    1      0     1  
    2      0     2  
    3      1     3  
    4      0     4  

这种实现遭受了错误的内存扩展:当我使用占用190 kB的DataFrame保存为utf8时,该函数使用~200 MB来创建TDM数据帧。当csv文件为600 kB时,该函数使用700 MB,当csv为3.8 MB时,该函数会耗尽所有内存和交换文件(8 GB)并崩溃。

我还使用稀疏矩阵和稀疏DataFrames(下面)进行了实现,但内存使用情况几乎相同,只是速度相当慢

    def df2tdm_sparse(df,titleColumn,placementColumn):
        '''
        Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn. This implementation uses sparse DataFrames.

        Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
        Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn

        Credits: 
        https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
        https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
        https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
        '''
        pm = df[[placementColumn]].values
        tm = countvec.fit_transform(df[titleColumn])#.toarray()
        m = csc_matrix(hstack([pm,tm]))
        dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
        dfout.columns = [placementColumn]+countvec.get_feature_names()
        return dfout

有关如何提高内存使用量的任何建议?我想知道这是否与scikit的内存问题有关,例如here

1 个答案:

答案 0 :(得分:0)

我还认为问题可能在于从稀疏矩阵到稀疏数据帧的转换。

尝试此功能(或类似的东西)

 def SparseMatrixToSparseDF(xSparseMatrix):
     import numpy as np
     import pandas as pd
     def ElementsToNA(x):
          x[x==0] = NaN
     return x 
    xdf1 = 
      pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel())) 
for i in np.arange(xSparseMatrix.shape[0]) ])
  return xdf1

您可以看到它通过使用密度

功能缩小了尺寸
 df1.density

我希望它有所帮助