python fit_transform大型数据集的索引错误

时间:2017-03-30 00:24:15

标签: python machine-learning scikit-learn text-classification

我正在对情绪分析进行文本分类,其中我有一个大型训练集和测试集

test_data_df.shape (46346, 2)  train_data_df.shape  (69518, 2)

train_data_df的第一列是label,如果电子邮件是个人攻击则其值为1,否则为0。 train_data_df的第二列是评论,即电子邮件内容。但是,当我尝试使用

fit_transform

将语料库转换为特征向量,我得到了

 corpus_data_features = vectorizer.fit_transform(train.comment.tolist() + 
 test.comment.tolist())
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site- 
 packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
 self.fixed_vocabulary_)
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
 packages/sklearn/feature_extraction/text.py", line 762, in _count_vocab
 for feature in analyze(doc):
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
 packages/sklearn/feature_extraction/text.py", line 241, in <lambda>
 tokenize(preprocess(self.decode(doc))), stop_words)
 File "assignment3.py", line 23, in tokenize
 stems = stem_tokens(tokens, stemmer)
 File "assignment3.py", line 14, in stem_tokens
 stemmed.append(stemmer.stem(item))
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site- 
 packages/nltk/stem/porter.py", line 665, in stem
 stem = self._step1b(stem)
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
 packages/nltk/stem/porter.py", line 376, in _step1b
 lambda stem: (self._measure(stem) == 1 and
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site- 
 packages/nltk/stem/porter.py", line 258, in _apply_rule_list
 if suffix == '*d' and self._ends_double_consonant(word):
 File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-
 packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
 word[-1] == word[-2] and
 IndexError: string index out of range

以下是我从https://www.codementor.io/jadianes/data-science-python-r-sentiment-classification-machine-learning-du107otfg学到的代码。谁能帮助我指出我可能出错的地方?非常感谢你!

import numpy as np 
import re, nltk
from sklearn.feature_extraction.text import CountVectorizer        
from nltk.stem.porter import PorterStemmer
test_data_df = pd.read_csv('test.tsv', delimiter="\t", quoting=3)
train_data_df = pd.read_csv('train.tsv', delimiter="\t", quoting=3)
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(comment):
    comment = re.sub("[^a-zA-Z]", " ", comment)
    tokens = nltk.word_tokenize(comment)
    stems = stem_tokens(tokens, stemmer)
    return stems

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 200
)
corpus_data_features = 
vectorizer.fit_transform(train_data_df.comment.tolist() +  test_data_df.comment.tolist())

最后一行代码是问题来自哪里。谢谢!

0 个答案:

没有答案