Question

在NLP过程中，我使用TF-IDF转换了文本语料库，产生了scipy.sparse.csr.csr_matrix。

然后我将这些数据分成训练和测试语料库，并对我的训练语料库重新采样，以解决班级不平衡问题。

我面临的问题是，当我使用重新采样的索引（来自pandas.Series类型的标签）对稀疏矩阵进行重新采样时，如下所示：

tfs[Ytr_resample.index]

这会花费很多时间，并输出错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-dd1413907d77> in <module>()
----> 1 tfs[Ytr_cat_resample.index]

/usr/local/lib/python3.5/dist-packages/scipy/sparse/csr.py in __getitem__(self, key)
    348         csr_sample_values(self.shape[0], self.shape[1],
    349                           self.indptr, self.indices, self.data,
--> 350                           num_samples, row.ravel(), col.ravel(), val)
    351         if row.ndim == 1:
    352             # row and col are 1d

ValueError: could not convert integer scalar

在this thread之后，我检查了索引中最大的元素不会大于稀疏矩阵中的行数。

问题似乎来自索引是用np.int64而不是np.int32编码的事实。确实有以下作品：

Xtr_resample = tfs[[np.int32(ind) for ind in Ytr_resample.index]]

因此，我有两个问题：

错误实际上是由int32到int64的转换引起的吗？
是否有更多的pythonic方式来转换索引类型？（Ytr_resample.index.astype(np.int32)似乎由于某种原因未更改类型）

编辑：

Ytr_resample.index的类型为pandas.core.indexes.numeric.Int64Index：

Int64Index([1484,  753, 1587, 1494,  357, 1484,   84,  823,  424,  424,
        ...
        1558, 1619, 1317, 1635,  537, 1206, 1152, 1635, 1206,  131],
       dtype='int64', length=4840)

我通过重新采样Ytr_resample（即Ytr）来创建pandas.Series，以使Ytr中存在的每个类别都具有相同数量的元素（通过过采样）：

n_samples = Ytr.value_counts(dropna = False).max()
Ytr_resample = pd.DataFrame(Ytr).groupby('cat').apply(\
                                lambda x: x.sample(n_samples,replace = True,random_state=42)).cat

ValueError：无法在scipy索引中转换整数标量

0 个答案: