Question

我对SettingWithCopyWarning有基本的了解，但是我无法弄清楚为什么我会收到针对此特定情况的警告。

我正在遵循https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

中的代码

当我运行以下代码（使用.loc）时，没有得到SettingWithCopyWarning

但是，如果我改为使用.iloc运行代码，则会收到警告。

有人可以帮我理解吗？

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Answer 1

这里的问题不是由于索引，iloc和loc在这里对您的工作方式相同。问题出在set_.drop("income_cat", axis=1, inplace=True)中。看来set_数据帧与strat_train_set和strat_test_set之间的引用很弱。

for set_ in (strat_train_set, strat_test_set):
         print(set_._is_copy)

有了这个，你得到：

<weakref at 0x128b30598; to 'DataFrame' at 0x128b355c0>
<weakref at 0x128b30598; to 'DataFrame' at 0x128b355c0>

这可能会导致SettingWithCopyWarning，因为它试图转换数据框的副本并将这些更改也应用于原始更改。

Answer 2

我做了一些探索，根据我的理解，这就是SettingWithCopyWarning的含义：每次从另一个帧df创建数据帧df_orig时，{{ 1}}采用一些启发式方法来确定是否可以从pandas中隐式复制数据，而经验不足的用户可能不会意识到。如果是这样，则将df_orig的{{1}}字段设置为_is_copy的{{3}}。稍后，当尝试对df进行就地更新时，df_orig将基于df以及{{ 1}}（请注意，pandas并不是唯一的条件）。但是，由于某些方法在不同场景之间共享，因此启发式方法并不完善，并且某些情况下可能会处理不当。

在帖子中的代码中，SettingWithCopyWarning和df._is_copy都返回df数据帧的隐式副本。

df._is_copy

以上检查产生以下结果：

housing.loc[train_index]

在这里，housing.iloc[train_index]是另一个字段，显示housing上的更新是否会影响原始数据帧for df in (housing.loc[train_index], housing.iloc[train_index]): print(df._is_view, df._is_copy)。 False None False <weakref at 0x0000019BFDF37958; to 'DataFrame' at 0x0000019BFDF26550>结果表明基础数据已被复制。但是，对于_is_view，未设置df字段，我认为在这种情况下应该设置为housing由语句False执行。

为了避免housing.loc[train_index]，您需要（1）在切片之前执行就地更新；或者（2）如果可能，将更新逻辑构建为切片；或（3）在切片后需要进行就地更新时对数据进行“显式”复制。在您的示例中，方法（1）如下所示：

df._is_copy

方法（2）如下：

SettingWithCopyWarning

方法（3）如下：

df

除了更改更新方法的df.drop("income_cat", axis=1, inplace=True)设置之外，weak reference是可用于制作“显式”副本的另一种方法。如果要更改SettingWithCopyWarning的一个或多个列，请使用df.copy()创建一个副本，而不要创建# Updates the housing data frame in-place before slicing income_cat = housing["income_cat"] housing.drop("income_cat", axis=1, inplace=True) for train_index, test_index in split.split(housing, income_cat): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]。

SettingWithCopyWarning-iloc与loc-无法找出原因

2 个答案: