多类文本分类-带有TF / IDF矢量化器的训练分类器

时间:2019-11-25 21:03:46

标签: python scikit-learn nlp text-processing

我对NLP还是陌生的,但现在我们在课堂上遇到了多类文本分类任务。数据集包含文档的第一页,这些文档需要分类为24个主题中的1个或多个。每个文档文本在表中都是一行。

我尝试实现TD / IDF vorctorizer,现在分类器返回错误“ ValueError:设置具有序列的数组元素。”

即使它试图告诉我到底出了什么问题,我也无法缠住它,我不确定该怎么办以及我的TF / IDF是否正确。这让我想到,矢量化器将为每个条目在矩阵中产生不同数量的列,对吗?分类器该如何使用?

这是我的代码:

import pandas as pd
import sklearn.model_selection as ms
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

X_train = pd.read_csv('train_values.csv', nrows=3, delimiter=',', engine='c')

#tokenize 
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
X_train['doc_text'] = X_train['doc_text'].apply(lambda x: tokenizer.tokenize(x.lower()))

#remove words
stopwords = set(stopwords.words('english'))
def remove_stopwords(text):
    words = [w for w in text if w not in stopwords]
    return words

#convert to list - otherwise vectorizer returns "'list' object has no attribute 'lower'"
X_train_list = X_train['doc_text'].tolist()

# compute TF/IDF
from sklearn.feature_extraction.text import TfidfVectorizer

X_train_n = []

for i in X_train_list:  
    vectorizer=TfidfVectorizer(use_idf=True)
    fitted_vectorizer=vectorizer.fit(i)
    vectorizer_vectors=fitted_vectorizer.transform(i)
    X_train_n.append(vectorizer_vectors)

y_train = pd.read_csv('train_labels.csv', nrows=3, delimiter=',', engine='c')
y_train_n = y_train.drop('row_id', axis=1)

y_train_n = np.array(y_train_n.as_matrix(columns = None), dtype=bool).astype(np.int) # I tried this as a test

#build classifier
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LinearSVC())
clf.fit(X_train_n, y_train_n)

  • 如何将分类器与此矢量化程序一起使用?还是我对矢量化器的实现是错误的?

任何帮助将不胜感激。

0 个答案:

没有答案