Question

我有两个列表Y_train和Y_test。目前，他们拥有分类数据。每个元素都是Blue或Green。它们将成为随机森林分类器的目标。我需要将它们编码为1.0s和0.0s。

这里是print(Y_train)，向您显示数据框的外观。下方的随机数是因为数据已被重新排序。（Y_test相同，只是较小）：

183      Blue
126      Blue
1        Blue
409      Blue
575    Green
         ...   
396      Blue
192      Blue
578    Green
838    Green
222      Blue
Name: Colour, Length: 896, dtype: object

要对此进行编码，我将简单地遍历它们并将每个元素更改为其编码值：

for i in range(len(Y_train)):
        if Y_train[i] == 'Blue':
            Y_train[i] = 0.0
        else:
            Y_train[i] = 1.0

但是，当我这样做时，我得到以下信息：

Traceback (most recent call last):
  File "G:\Work\Colours.py", line 90, in <module>
    Main()
  File "G:\Work\Colours.py", line 34, in Main
    RandForest(X_train, Y_train, X_test, Y_test)
  File "G:\Work\Colours.py.py", line 77, in RandForest
    if Y_train[i] == 'Blue':
  File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py", line 1068, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Users\Me\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
    return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
  File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 6

奇怪的是，它在不同的时间产生此错误。我用过标志和印刷品看它能走多远。有时，它会进入循环中进行很多迭代，然后有时它只会在中断之前进行一到两次迭代。

我假设我只是不太了解您应该如何正确地遍历数据帧。如果某人在这方面有更多经验，可以帮助我，那就太好了。

Answer 1

尝试：

 Y_train[Y_train == 'Blue']=0.0
 Y_train[Y_train == 'Green']=1.0

那应该可以解决您的问题。

Answer 2

如果标签的数量甚至超过当前示例（蓝色和绿色），sklearn提供了一个标签编码器，使您可以轻松地使用此标签

from sklearn import preprocessing 

label_encoder = preprocessing.LabelEncoder() 

# Transforms the 'column' in your dataframe df
df['column']= label_encoder.fit_transform(df['column'])

Answer 3

如果您使用自己的方法来标记编码，最好创建一个单独的编码列而不是修改原始列。之后，您可以将编码列分配给数据框。以您的情况为例。

encoded = np.ones((Y_train.shape[0],1))
for i in range(Y_train.shape[0]):
        if Y_train[i] == 'Blue':
            encoded[i] = 0

请注意，只有当您有两个类别时，此选项才适用。

对于多个类别，可以使用sklearn或pandas方法。

针对多个类别

另一种方法是使用熊猫cat.codes。您可以将熊猫系列转换为类别并获取类别代码。

Y_train = pd.Series(Y_train)
encoded = Y_train.astype("category").cat.codes

您也可以使用sklearn Labelencoder对分类数据进行编码。

from sklearn.preprocessing import  LabelEncoder 
le = LabelEncoder()
encoded = le.fit_transform(Y_train)

遍历熊猫数据框时出现“ KeyError：”错误？

3 个答案: