在Python Mapper Reducer中使用CountVectorizer

时间:2014-04-01 22:33:14

标签: python scikit-learn tokenize mapper reducers

我正在尝试使用python mapper reducer函数应用tokenizer。我有以下代码,但我一直收到错误。 reducer输出列表中的值,我将值传递给矢量化器。

from mrjob.job import MRJob
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

class bagOfWords(MRJob):

def mapper(self, _, line):
    cat, phrase, phraseid, sentiment = line.split(',')
    yield (cat, phraseid, sentiment), phrase

def reducer(self, keys, values):

    yield keys, list(values)

#Output: ["Train", "--", "2"] ["A series of escapades demonstrating the adage that    what is good for the goose", "A series", "A", "series"]

def mapper(self, keys, values):
    vectorizer = CountVectorizer(min_df=0)
    vectorizer.fit(values)
    x = vectorizer.transform(values)
    x=x.toarray()       
    yield keys, (x)


if __name__ == '__main__':
    bagOfWords.run()

ValueError:空词汇;也许这些文件只包含停用词

感谢您提供任何帮助。

1 个答案:

答案 0 :(得分:0)

CountVectorizer是有状态的:您需要在完整数据集上安装相同的一个实例来构建词汇表,因此这不适合并行处理。

相反,您可以使用无状态的HashingVectorizer(无需适合,您可以直接调用transform