如何添加列并保存在文本文件稀疏矩阵中?

时间:2016-03-09 01:35:29

标签: python pandas scikit-learn

TfidfVectorizer在输出上返回稀疏矩阵,可以很容易地将其转换为SparseDataFrame(不是常规的)。但我无法弄清楚如何向其添加列并保存在csv文件中。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def get_features(data, train=False):
    tfv = TfidfVectorizer()
    if train:
        features = tfv.fit_transform(data["text"])
    else:
        features = tfv.transform(data["text"])

    features_pd = pd.SparseDataFrame([ pd.SparseSeries(features[i].toarray().ravel())
                                 for i in np.arange(features.shape[0]) ], columns = tfv.get_feature_names() )
# the next 2 lines in replacement of the previous result in empty (commas only) output  
#    features_pd = pd.DataFrame([ pd.Series(features[i].toarray().ravel())
#                                 for i in np.arange(features.shape[0]) ], columns = tfv.get_feature_names() )
# the next line results in TypeError: ufunc 'isnan' not supported for the input types ...   
   # features_pd['_class_'] = pd.SparseSeries(data["class"])

    print "F:",features_pd.iloc[[0]]
    return features_pd

if __name__ == '__main__':

    train = pd.read_csv('train.csv', header=None, names = ["class", "text"]).fillna("")
    features = get_features(train, train=True)
    features.to_csv('out.csv', index=False)

1 个答案:

答案 0 :(得分:0)

稀疏矩阵可以转换为数组,然后可以使用常规数据帧执行所有操作 要做出改变的核心:features.toarray()

features_pd = pd.DataFrame(data=features.toarray(),
                           columns = tfv.get_feature_names() )

features_pd['_class_'] = pd.Series(data["class"], index = features_pd.index)