Question

我的数据集中有3列：

审核：产品评论

类型：类别或产品类型

费用：产品成本

这是一个多类问题，Type为目标变量。此数据集中有64种不同类型的产品。

审核和费用是我的两个功能。

我删除了 Type 变量，将数据拆分为4组：

X = data.drop('type', axis = 1)
y = data.type
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

对于评论，我使用以下内容进行矢量化：

vect = CountVectorizer(stop_words = stop)
X_train_dtm = vect.fit_transform(X_train.review)

在这里，我被卡住了！

为了运行模型，我需要在训练集中同时具备我的两个功能，但是，由于X_train_dtm是一个稀疏矩阵，我不确定如何连接我的pandas系列成本功能到那个稀疏矩阵。由于数据已经是成本的数字，我不认为我需要对其进行转换，这就是为什么我没有使用类似＆＃34; FeatureUnion＆＃34;它结合了2个转换功能。

任何帮助都将不胜感激!!

示例数据：

| Review           | Cost        | Type         |
|:-----------------|------------:|:------------:|
| This is a review |        200  |     Toy     
| This is a review |        100  |     Toy    
| This is a review |        800  |  Electronics     
| This is a review |         35  |     Home

更新

在应用tarashypka的解决方案后，我能够摆脱向X_train_dtm添加第二个功能。但是，我在尝试在测试集上运行相同时遇到错误：

来自scipy.sparse import hstack

vect = CountVectorizer(stop_words = stop)
X_train_dtm = vect.fit_transform(X_train.review)
prices = X_train.prices.values[:,None]
X_train_dtm = hstack((X_train_dtm, prices))

#Works perfectly for the training set above
#But when I run with test set I get the following error
X_test_dtm = vect.transform(X_test)
prices_test = X_test.prices.values[:,None]
X_test_dtm = hstack((X_test_dtm, prices_test))

Traceback (most recent call last):

  File "<ipython-input-10-b2861d63b847>", line 8, in <module>
    X_test_dtm = hstack((X_test_dtm, points_test))

  File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 464, in hstack
    return bmat([blocks], format=format, dtype=dtype)

  File "C:\Users\k\Anaconda3\lib\site-packages\scipy\sparse\construct.py", line 581, in bmat
    'row dimensions' % i)

ValueError: blocks[0,:] has incompatible row dimensions

Answer 1

CountVectorizer的结果X_train_dtm，scipy.sparse.csr_matrix类型scipy.sparse.hstack。如果您不想将其转换为numpy数组，那么>> from scipy.sparse import hstack >> prices = X_train['Cost'].values[:, None] >> X_train_dtm = hstack((X_train_dtm, prices))是添加其他列的方式

<form method="post" name="contact_form" action="submit.php">
            <input id="URL" name="URL" type="text" placeholder="Website URL (www.yoursite.com)"><input type="submit" value="Submit!">
            <input id="email" name="email" type="text" placeholder="Your Email Address"><input type="submit" value="Send!">
</form>

Answer 2

使用FeatureUnion为您隐藏一些东西。 example on heterogeneous data非常类似于你的问题。

如何使用sklearn向计数器功能添加第二个功能？

2 个答案: