使用sklearn矢量化文件

时间:2015-10-18 08:29:17

标签: python-2.7 scikit-learn vectorization pythonxy

我正在尝试阅读100个训练文件并使用sklean对其进行矢量化。这些文件的内容是表示系统调用的单词。一旦矢量化,我想打印出矢量。 我的第一次尝试如下:

import re
import os
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import numpy.linalg as LA

trainingdataDir = 'C:\data\Training data'

def readfile():
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles):
         data = open(trainingfiles, "rb").read()

    return data 

train_set = [readfile()]

vectorizer = CountVectorizer()
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray

但是,这只返回最后一个文件的向量。 我得出结论,打印功能应放在for循环中。所以第二次尝试:

def readfile():
    for file in os.listdir(trainingdataDir):
        trainingfiles = os.path.join(trainingdataDir, file)
        if os.path.isfile(trainingfiles):
         data = open(trainingfiles, "rb").read()
    trainVectorizerArray = vectorizer.fit_transform(data).toarray()
    print 'Fit Vectorizer to train set', trainVectorizerArray          

但是,这不会返回任何内容。 你能帮我解决这个问题吗?为什么我无法看到正在打印的矢量?

1 个答案:

答案 0 :(得分:0)

问题是因为用于矢量化的数据集列表是空的。我设法对一组100个文件进行矢量化。我首先打开文件,然后读取每个文件,最后将它们添加到列表中。然后,“tfidf_vectorizer'

使用数据集列表
app.service('resultDeals',['$translate','$cookies','$http', '$q',
function($translate,$cookies,$http,$q) {

  var currentOrigin = {};
  var originsUser={};

  return {

    loadOrigins:function() {
      var deferred = $q.defer();

      $http.get('app/deals/deal.json').success(function(response){
        console.log(response);
        originsUser = response.data;
        deferred.resolve(originUser),
      }).error(function(err){
        console.log(err);
        deferred.reject();
      });

      return deferred.promise;
    },
    userOrigin:originsUser
  };

}]);

// In controller
resultDeals.loadOrigins().success(function(updateOrigins) {
    $scope.updateOrigins = updateOrigins;
}).error(function() {
    console.log('bad !');
});