我正在对情绪分析进行文本分类,其中我有一个大型训练集和测试集
test_data_df.shape (46346, 2) train_data_df.shape (69518, 2)
train_data_df的第一列是label,如果电子邮件是个人攻击则其值为1,否则为0。 train_data_df的第二列是评论,即电子邮件内容。但是,当我尝试使用
时fit_transform
将语料库转换为特征向量,我得到了
corpus_data_features = vectorizer.fit_transform(train.comment.tolist() +
test.comment.tolist())
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/sklearn/feature_extraction/text.py", line 762, in _count_vocab
for feature in analyze(doc):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/sklearn/feature_extraction/text.py", line 241, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "assignment3.py", line 23, in tokenize
stems = stem_tokens(tokens, stemmer)
File "assignment3.py", line 14, in stem_tokens
stemmed.append(stemmer.stem(item))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/nltk/stem/porter.py", line 665, in stem
stem = self._step1b(stem)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/nltk/stem/porter.py", line 376, in _step1b
lambda stem: (self._measure(stem) == 1 and
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/nltk/stem/porter.py", line 258, in _apply_rule_list
if suffix == '*d' and self._ends_double_consonant(word):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
word[-1] == word[-2] and
IndexError: string index out of range
以下是我从https://www.codementor.io/jadianes/data-science-python-r-sentiment-classification-machine-learning-du107otfg学到的代码。谁能帮助我指出我可能出错的地方?非常感谢你!
import numpy as np
import re, nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
test_data_df = pd.read_csv('test.tsv', delimiter="\t", quoting=3)
train_data_df = pd.read_csv('train.tsv', delimiter="\t", quoting=3)
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(comment):
comment = re.sub("[^a-zA-Z]", " ", comment)
tokens = nltk.word_tokenize(comment)
stems = stem_tokens(tokens, stemmer)
return stems
vectorizer = CountVectorizer(
analyzer = 'word',
tokenizer = tokenize,
lowercase = True,
stop_words = 'english',
max_features = 200
)
corpus_data_features =
vectorizer.fit_transform(train_data_df.comment.tolist() + test_data_df.comment.tolist())
最后一行代码是问题来自哪里。谢谢!