Question

对于我的论文，我正在使用Python进行机器学习项目，其中包括从文本中提取特征。首先，我尝试使用sci-kit learn来实现bi-gram。

现在，当我通过Countvectorizer处理我的数据时，我得到了一个只有1的数组，有时甚至更多。例如：

`[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]`

我想使用这些二元组来预测我的目标变量，这是绝对的。当我现在执行我的代码时，Python返回我的两个数组的形状不相同。

`[[1 3 2 ..., 1 1 1]] [ 0.  0.  1.  0.  0.]`

有人能告诉我我做错了什么吗？我正在使用这个命令为bi-gram。第一部分是数据集中每个文本（电影情节）的循环。

        plottext = [ row[8] ]
        wordvec = CountVectorizer(ngram_range=(2,2), analyzer='word')
        plotvec = wordvec.fit_transform(plottext).toarray()
        matrix_terms = np.array(wordvec.get_feature_names())
        matrix_freq = np.asarray(plotvec.sum(axis=0)).ravel()
        final_matrix = np.array([matrix_terms,matrix_freq])
        target = { 'Age': row[4] }
        data.append((final_matrix, target))
# Convert categorial target variable to Y
(X, Ycat) = zip(*data)
vec = DictVectorizer(sparse=False)
Y = vec.fit_transform(Ycat)
#Extract textual features from plot
return (X, Y)

我收到错误消息

ValueError: could not broadcast input array from shape (2,830) into shape (2)

N-Grams到阵列

0 个答案: