我正在sklearn中处理大量的文本数据。首先,我需要对文本上下文进行矢量化(字数统计),然后执行TfidfTransformer。我有以下代码似乎没有从CountVectorizer输出到TfidfTransformer的输入。
$('#file_upload').uploadifive({
'auto' : true,
'buttonText' : 'Select File',
'checkScript' : false,
'queueSizeLimit' : 10,
'fileSizeLimit' : 10737418240,
'dnd' : false,
'multi' : true,
'removeCompleted' : true,
'queueID' : 'queue',
'simUploadLimit' : 30,
'uploadLimit' : 30,
'overrideEvents' : ['onProgress'],
'uploadScript' : '/files/pending',
// Triggered for each file that is added to the queue
'onAddQueueItem' : function(file) {
console.log(file.name + "file that is added to the queue");
var folder = files.currentItemData();
this.data('uploadifive').settings.formData = {
'timestamp' : current_date_time(),
'authenticity_token' : token(),
'attachment[folder_id]' : folder.id,
'attachment[context_code]' : current_context_code(),
'attachment[language]' : uploadifive_language_code(),
'success_action_status' : "201",
//'attachment[duplicate_handling]': file.duplicate_handling,
};
},
// Triggered when a file is cancelled or removed from the queue
'onCancel' : function() {},
// Triggered when the server is checked for an existing file
'onCheck' : function() {},
// Triggered during the clearQueue function
'onClearQueue' : function() {},
// Triggered when files are dropped into the file queue
'onDrop' : function() {},
// Triggered when an error occurs
'onError' : function(file, fileType, data) {},
// Triggered if the HTML5 File API is not supported by the browser
'onFallback' : function() {},
// Triggered when UploadiFive if initialized
'onInit' : function() {},
// Triggered once when an upload queue is done
'onQueueComplete' : function(file, data) {},
// Triggered during each progress update of an upload
'onProgress' : function(file, e) {
if (e.lengthComputable) {
var percent = Math.round((e.loaded / e.total) * 100);
}
$('.uploadifive-queue-item').find('.fileinfo').html(' - ' + percent + '%');
$('.uploadifive-queue-item').find('.progress-bar').css('width', percent + '%');
},
// Triggered once when files are selected from a dialog box
'onSelect' : function(file) {
console.log(file.queued + ' files were added to the queue.');
return;
},
// Triggered when an upload queue is started
'onUpload' : function(file) {
console.log(file + ' files will be uploaded.');
file_select_done();
},
// Triggered when a file is successfully uploaded
'onUploadComplete' : function(file, data) {
res_data = JSON.parse(data);
$.ajaxJSON(res_data.success_url,'GET',{"authenticity_token" : authenticity_token},function(data) {
if (data && data["attachment"]){
var file_id = data["attachment"]["id"];
var file_name = data["attachment"]["display_name"];
generate_li(file_name, file_id);
qutaUpdate();
}
});
console.log('The file ' + file.name + ' uploaded successfully.');
},
// Triggered for each file being uploaded
'onUploadFile' : function(file) {
console.log('The file ' + file.name + ' is being uploaded.');
}
});});
当我运行此代码时,我收到此错误:
TEXT = [data[i].values()[3] for i in range(len(data))]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer(min_df=0.01,max_df = 2.5, lowercase = False, stop_words = 'english')
X = vectorizer(TEXT)
transformer = TfidfTransformer(X)
X = transformer.fit_transform()
我以为我已经对文本进行了矢量化,现在它已经在矩阵中 - 是否有一个我错过的过渡步骤?谢谢!!
答案 0 :(得分:7)
这一行
X = vectorizer(TEXT)
不会产生vectorizer的输出(这是引发异常的那个,它与TfIdf本身无关),你应该调用fit_transform
。此外,你的下一个电话也是错误的。您必须将数据作为参数传递给fit_transform
,而不是传递给构造函数。
X = vectorizer.fit_transform(TEXT)
transformer = TfidfTransformer()
X = transformer.fit_transform(X)
答案 1 :(得分:3)
您可能正在寻找pipeline,可能是这样的:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
或
pipeline = make_pipeline(CountVectorizer(), TfidfTransformer())
在此管道上,执行常规操作(例如fit
,fit_transform
等等。
另见this example。