我有以下问题。现在我正在构建一个分类器系统,它将使用文本和一些额外的补充信息作为输入。我在pandas DataFrame中存储了免费信息。我使用CountVectorizer转换文本并获得稀疏矩阵。现在,为了训练分类器,我需要将两个输入都放在同一个数据帧中。问题是,当我将数据帧与CountVectorizer的输出合并时,我得到一个密集矩阵,这意味着我的内存耗尽非常快。有没有办法避免它并将这2个输入正确地合并到单个数据帧中而不会得到密集矩阵?
示例代码:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#how many most popular words we consider
n_features = 5000
df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])
df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)
#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])
#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']
#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)
#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)
如您所见,我将CountVectorizer设置为5000个单词。我的原始数据帧中只有50000行,但我已经获得了50000x5000个单元格的矩阵,这是25亿个单位。它已经需要大量的内存。
答案 0 :(得分:6)
您不需要使用数据框。
将数值特征从数据框转换为numpy
数组:
num_feats = df[[cols]].values
from scipy import sparse
training_data = sparse.hstack((count_vectorizer_features, num_feats))
然后你可以使用支持稀疏数据的scikit-learn算法。
对于GBM,您可以使用支持稀疏的xgboost
。
答案 1 :(得分:3)
正如@AbhishekThakur所说,你不必将你的单热编码数据放入DataFrame。
但是如果你想这样做,可以将Pandas.SparseSeries添加为列:
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))
# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)
结果:
In [107]: df.head(3)
Out[107]:
asin price reviewerID LenReview Summary LenSummary overall helpful reviewSentiment 0 \
0 151972036 8.48 A14NU55NQZXML2 199 really a difficult read 23 3 2 -0.7203 0.002632
1 151972036 8.48 A1CSBLAPMYV8Y0 77 wha 3 4 0 -0.1260 0.005556
2 151972036 8.48 A1DDECXCGHDYZK 114 wordy and drags on 18 1 4 0.5707 0.004545
... think thought trailers trying wanted words worth wouldn writing young
0 ... 0 0 0 0 1 0 0 0 0 0
1 ... 0 0 0 1 0 0 0 0 0 0
2 ... 0 0 0 0 1 0 1 0 0 0
[3 rows x 78 columns]
注意内存使用情况:
In [108]: df.memory_usage()
Out[108]:
Index 80
asin 112
price 112
reviewerID 112
LenReview 112
Summary 112
LenSummary 112
overall 112
helpful 112
reviewSentiment 112
0 112
1 112
2 112
3 112
4 112
5 112
6 112
7 112
8 112
9 112
10 112
11 112
12 112
13 112
14 112
...
parts 16 # memory used: # of ones multiplied by 8 (np.int64)
peter 16
picked 16
point 16
quick 16
rating 16
reader 16
reading 24
really 24
reviews 16
stars 16
start 16
story 32
tedious 16
things 16
think 16
thought 16
trailers 16
trying 16
wanted 24
words 16
worth 16
wouldn 16
writing 24
young 16
dtype: int64
答案 2 :(得分:1)
Pandas还支持导入稀疏矩阵,它使用其sparseDtype存储
import scipy.sparse
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)
您可以将其连接到数据框的其余部分