sklearn CountVectorizer返回所有零 - 字符串转换问题?

时间:2017-07-28 05:06:28

标签: python python-2.7 pandas scikit-learn countvectorizer

我正在尝试使用sklearn的CountVectorizer和给定的词汇表。我的词汇是:

['humanitarian crisis', 'vacations for the anti-cruise crowd', 'school textbook', "b'cruise vacations for the anti-cruise", 'budget deal', "b'public school", 'u.n. announces', 'wrong petrol', 'vacations for the anti-cruise', "b'cruise vacations for the anti-cruise crowd"]

矢量化的输入取自pandas数据帧。我是通过pd.read_csvencoding='utf8'

的csv阅读此内容的
29371            b'9 quirky and brilliant paris boutiques'
20525    b'public school textbook filled with muslim bi...
2871     b'congress focuses on averting shutdown, but t...
29902    b'yarmouk siege: u.n. announces trip to syria ...
45596    b'fracking protesters arrested for gluing them...
6266         b'cruise vacations for the anti-cruise crowd'

致电CountVectorizer(vocabulary=vocabulary).fit_transform()后,我得到一个全零的矩阵:

(<6x10 sparse matrix of type '<type 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>, <class 'scipy.sparse.csr.csr_matrix'>)

这是一个问题,因为字符串类型,或者我如何调用CountVectorizer的问题?我不确定如何转换字符串类型;我在python2.7和pandas中尝试了多次对encodedecode的不同调用。任何建议将不胜感激。

1 个答案:

答案 0 :(得分:1)

调用CountVectorizer时,请使用“ ngram_range =(min_word_count,max_word_count)”。