将CountVectorizer中的稀疏矩阵添加到数据帧中,并为分类器提供补充信息 - 将其保持为稀疏格式

时间:2017-04-23 23:21:53

标签: python pandas dataframe machine-learning scikit-learn

我有以下问题。现在我正在构建一个分类器系统,它将使用文本和一些额外的补充信息作为输入。我在pandas DataFrame中存储了免费信息。我使用CountVectorizer转换文本并获得稀疏矩阵。现在,为了训练分类器,我需要将两个输入都放在同一个数据帧中。问题是,当我将数据帧与CountVectorizer的输出合并时,我得到一个密集矩阵,这意味着我的内存耗尽非常快。有没有办法避免它并将这2个输入正确地合并到单个数据帧中而不会得到密集矩阵?

示例代码:

import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

#how many most popular words we consider
n_features = 5000

df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)

#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
                                max_features=n_features,
                                stop_words='english')

#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])

df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)

#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])


#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']

#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)


#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)

print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)

如您所见,我将CountVectorizer设置为5000个单词。我的原始数据帧中只有50000行,但我已经获得了50000x5000个单元格的矩阵,这是25亿个单位。它已经需要大量的内存。

3 个答案:

答案 0 :(得分:6)

您不需要使用数据框。

将数值特征从数据框转换为numpy数组:

num_feats = df[[cols]].values

from scipy import sparse

training_data = sparse.hstack((count_vectorizer_features, num_feats))

然后你可以使用支持稀疏数据的scikit-learn算法。

对于GBM,您可以使用支持稀疏的xgboost

答案 1 :(得分:3)

正如@AbhishekThakur所说,你不必将你的单热编码数据放入DataFrame。

但是如果你想这样做,可以将Pandas.SparseSeries添加为列:

#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
                                max_features=n_features,
                                stop_words='english')

#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))

# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
    df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)

结果:

In [107]: df.head(3)
Out[107]:
        asin  price      reviewerID  LenReview                  Summary  LenSummary  overall  helpful  reviewSentiment         0  \
0  151972036   8.48  A14NU55NQZXML2        199  really a difficult read          23        3        2          -0.7203  0.002632
1  151972036   8.48  A1CSBLAPMYV8Y0         77                      wha           3        4        0          -0.1260  0.005556
2  151972036   8.48  A1DDECXCGHDYZK        114       wordy and drags on          18        1        4           0.5707  0.004545

   ...    think  thought  trailers  trying  wanted  words  worth  wouldn  writing  young
0  ...        0        0         0       0       1      0      0       0        0      0
1  ...        0        0         0       1       0      0      0       0        0      0
2  ...        0        0         0       0       1      0      1       0        0      0

[3 rows x 78 columns]

注意内存使用情况:

In [108]: df.memory_usage()
Out[108]:
Index               80
asin               112
price              112
reviewerID         112
LenReview          112
Summary            112
LenSummary         112
overall            112
helpful            112
reviewSentiment    112
0                  112
1                  112
2                  112
3                  112
4                  112
5                  112
6                  112
7                  112
8                  112
9                  112
10                 112
11                 112
12                 112
13                 112
14                 112
                  ...
parts               16   # memory used: # of ones multiplied by 8 (np.int64)
peter               16
picked              16
point               16
quick               16
rating              16
reader              16
reading             24
really              24
reviews             16
stars               16
start               16
story               32
tedious             16
things              16
think               16
thought             16
trailers            16
trying              16
wanted              24
words               16
worth               16
wouldn              16
writing             24
young               16
dtype: int64

答案 2 :(得分:1)

Pandas还支持导入稀疏矩阵,它使用其sparseDtype存储

import scipy.sparse    
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)

您可以将其连接到数据框的其余部分