使用UnbalancedDataset包对数据集进行过采样时的KeyError(在pandas.index.IndexEngine.get_loc中)

时间:2016-05-22 23:55:02

标签: python pandas scikit-learn

我正在尝试使用UnbalancedDataset来过度采样我的数据。遵循sklearn惯例,我将X,y作为特征矩阵和目标向量。它们是pandas.core.frame.DataFrame类型,其形状分别为(200000,17)和(200000)。

我首先使用sklean的 train_test_split 拆分数据。然后应用 SMOTE 方法对训练数据集进行过采样,结果导致以下错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_loc(self, key, method, tolerance)
   1944             try:
-> 1945                 return self._engine.get_loc(key)
   1946             except KeyError:

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

KeyError: 1143

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-99-1c5830417b3f> in <module>()
      6 # 'SMOTE'
      7 SM = SMOTE(ratio=ratio, verbose=verbose, kind='regular')
----> 8 smx, smy = SM.fit_transform(Xtrain, ytrain)

C:\Users\...\Anaconda3\lib\site-packages\unbalanceddataset-0.1-py3.5.egg\unbalanced_dataset\unbalanced_dataset.py in fit_transform(self, x, y)
    274             return self.out_x, self.out_y, self.out_idx
    275         else:
--> 276             self.out_x, self.out_y = self.resample()
    277 
    278             return self.out_x, self.out_y

C:\Users\...\Anaconda3\lib\site-packages\unbalanceddataset-0.1-py3.5.egg\unbalanced_dataset\over_sampling.py in resample(self)
    358                                        step_size=1.0,
    359                                        random_state=self.rs,
--> 360                                        verbose=self.verbose)
    361 
    362             if self.verbose:

C:\Users\...\Anaconda3\lib\site-packages\unbalanceddataset-0.1-py3.5.egg\unbalanced_dataset\unbalanced_dataset.py in make_samples(x, nn_data, y_type, nn_num, n_samples, step_size, random_state, verbose)
    388 
    389             # Construct synthetic sample
--> 390             new[i] = x[row] - step * (x[row] - nn_data[nn_num[row, col]])
    391 
    392         # The returned target vector is simply a repetition of the

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

KeyError: 1143

我在同一数据的UnbalancedDataset的所有欠采样方法都正常工作时出现此错误。处理过采样问题的任何建议?

更新

正如glemaitre所提到的,为了解决这个问题,需要将Pandas DataFrame转换为Numpy数组。因此,以下转换可以解决问题:

Xc = Xtrain.as_matrix()

1 个答案:

答案 0 :(得分:1)

UnbalancedDataset期待numpy数组。尝试将其插入到函数中,看看是否有效。

干杯