我对NLP还是陌生的,但现在我们在课堂上遇到了多类文本分类任务。数据集包含文档的第一页,这些文档需要分类为24个主题中的1个或多个。每个文档文本在表中都是一行。
我尝试实现TD / IDF vorctorizer,现在分类器返回错误“ ValueError:设置具有序列的数组元素。”
即使它试图告诉我到底出了什么问题,我也无法缠住它,我不确定该怎么办以及我的TF / IDF是否正确。这让我想到,矢量化器将为每个条目在矩阵中产生不同数量的列,对吗?分类器该如何使用?
这是我的代码:
import pandas as pd
import sklearn.model_selection as ms
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
X_train = pd.read_csv('train_values.csv', nrows=3, delimiter=',', engine='c')
#tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
X_train['doc_text'] = X_train['doc_text'].apply(lambda x: tokenizer.tokenize(x.lower()))
#remove words
stopwords = set(stopwords.words('english'))
def remove_stopwords(text):
words = [w for w in text if w not in stopwords]
return words
#convert to list - otherwise vectorizer returns "'list' object has no attribute 'lower'"
X_train_list = X_train['doc_text'].tolist()
# compute TF/IDF
from sklearn.feature_extraction.text import TfidfVectorizer
X_train_n = []
for i in X_train_list:
vectorizer=TfidfVectorizer(use_idf=True)
fitted_vectorizer=vectorizer.fit(i)
vectorizer_vectors=fitted_vectorizer.transform(i)
X_train_n.append(vectorizer_vectors)
y_train = pd.read_csv('train_labels.csv', nrows=3, delimiter=',', engine='c')
y_train_n = y_train.drop('row_id', axis=1)
y_train_n = np.array(y_train_n.as_matrix(columns = None), dtype=bool).astype(np.int) # I tried this as a test
#build classifier
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LinearSVC())
clf.fit(X_train_n, y_train_n)
任何帮助将不胜感激。