从skVarn中的CountVectorizer到TfidfTransformer的转换

时间:2016-07-30 17:04:36

标签: python scikit-learn vectorization tf-idf

我正在sklearn中处理大量的文本数据。首先,我需要对文本上下文进行矢量化(字数统计),然后执行TfidfTransformer。我有以下代码似乎没有从CountVectorizer输出到TfidfTransformer的输入。

$('#file_upload').uploadifive({
  'auto'                : true,
  'buttonText'          : 'Select File',
  'checkScript'         : false,
  'queueSizeLimit'      : 10,
  'fileSizeLimit'       : 10737418240,
  'dnd'                 : false,
  'multi'               : true,
  'removeCompleted'     : true,
  'queueID'             : 'queue',
  'simUploadLimit'      : 30,
  'uploadLimit'         : 30,
  'overrideEvents'      : ['onProgress'],
  'uploadScript'        : '/files/pending',
  // Triggered for each file that is added to the queue
  'onAddQueueItem'   : function(file) {
    console.log(file.name + "file that is added to the queue");
    var folder = files.currentItemData();
    this.data('uploadifive').settings.formData = {
                          'timestamp'                     : current_date_time(),
                          'authenticity_token'            : token(),
                          'attachment[folder_id]'         : folder.id,
                          'attachment[context_code]'      : current_context_code(),
                          'attachment[language]'          : uploadifive_language_code(),
                          'success_action_status'         : "201",
                        //'attachment[duplicate_handling]': file.duplicate_handling,
                                                  };
  },
  // Triggered when a file is cancelled or removed from the queue
  'onCancel'         : function() {},
  // Triggered when the server is checked for an existing file
  'onCheck'          : function() {},
  // Triggered during the clearQueue function
  'onClearQueue'     : function() {},
  // Triggered when files are dropped into the file queue
  'onDrop'           : function() {},
  // Triggered when an error occurs
  'onError'          : function(file, fileType, data) {},
  // Triggered if the HTML5 File API is not supported by the browser
  'onFallback'       : function() {},
  // Triggered when UploadiFive if initialized
  'onInit'            : function() {},
  // Triggered once when an upload queue is done
  'onQueueComplete'  : function(file, data) {},
  // Triggered during each progress update of an upload
  'onProgress'   : function(file, e) {
    if (e.lengthComputable) {
      var percent = Math.round((e.loaded / e.total) * 100);
    }
    $('.uploadifive-queue-item').find('.fileinfo').html(' - ' + percent + '%');
    $('.uploadifive-queue-item').find('.progress-bar').css('width', percent + '%');
  },
  // Triggered once when files are selected from a dialog box
  'onSelect'          : function(file) {
    console.log(file.queued + ' files were added to the queue.');
    return;
  },
  // Triggered when an upload queue is started
  'onUpload'          : function(file) {
    console.log(file + ' files will be uploaded.');
    file_select_done();
  },
  // Triggered when a file is successfully uploaded
  'onUploadComplete'  : function(file, data) {
    res_data = JSON.parse(data);
    $.ajaxJSON(res_data.success_url,'GET',{"authenticity_token" : authenticity_token},function(data) {
      if (data && data["attachment"]){
        var file_id = data["attachment"]["id"];
        var file_name = data["attachment"]["display_name"];
        generate_li(file_name, file_id);
        qutaUpdate();
      }
    });
    console.log('The file ' + file.name + ' uploaded successfully.');
  },
  // Triggered for each file being uploaded
  'onUploadFile'      : function(file) {
    console.log('The file ' + file.name + ' is being uploaded.');
  }
});});

当我运行此代码时,我收到此错误:

TEXT = [data[i].values()[3] for i in range(len(data))]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

vectorizer = CountVectorizer(min_df=0.01,max_df = 2.5, lowercase = False, stop_words = 'english')

X = vectorizer(TEXT)
transformer = TfidfTransformer(X)
X = transformer.fit_transform()

我以为我已经对文本进行了矢量化,现在它已经在矩阵中 - 是否有一个我错过的过渡步骤?谢谢!!

2 个答案:

答案 0 :(得分:7)

这一行

X = vectorizer(TEXT)

不会产生vectorizer的输出(这是引发异常的那个,它与TfIdf本身无关),你应该调用fit_transform。此外,你的下一个电话也是错误的。您必须将数据作为参数传递给fit_transform,而不是传递给构造函数。

X = vectorizer.fit_transform(TEXT)
transformer = TfidfTransformer()
X = transformer.fit_transform(X)

答案 1 :(得分:3)

您可能正在寻找pipeline,可能是这样的:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
])

pipeline = make_pipeline(CountVectorizer(), TfidfTransformer())

在此管道上,执行常规操作(例如fitfit_transform等等。

另见this example