Question

大家好，祝大家新年快乐。

当我尝试手动编码某些标签时，我发现了一些奇怪的熊猫行为。如果有人能够解释为什么会这样，那将是非常棒的。

所以这是我的代码

<ul>
    <li>Item 1</li>
    <li>Item 2
        <ul>
            <li>Nested item 1</li>
            <li>Nested item 2</li>
        </ul>
    </li>
</ul>

所以我的问题是，测试数据中import numpy as np import pandas as pd from seaborn import load_dataset as data titanic = data('titanic') # Set a random set s.t. you get the same as i get np.random.seed(12345) # Draw 91 Observations as test df_test = titanic.sample(n = 91 ) # Take the remaining as trainings data df_train = titanic.drop(df_test.index) # Delet the old data to free the memory del titanic # Save references of the data in a list s.t. we can loop over them frames = [df_train, df_test] # loop over the columns and the dataframes an set them as categorical for df in frames: for col in 'sex embarked embark_town alone who'.split(): # save the destinct values of the training data # HERE IS THE MISTAKE uniques = df_train[col].unique() # encode those unique values numerically for n, uni in enumerate(uniques): df.loc[df[col] == uni , col] = n变量的标签不会被覆盖。我得到了

embarked

而不是：

print(df_train.embarked.unique() )
[0 1 2 nan]

print(df_test.embarked.unique()) 
['S' 'Q' 'C']

通过一些实验，我发现行print(df_train.embarked.unique() ) [0 1 2 nan] print(df_test.embarked.unique()) [0 1 2 ]就是原因。如果这会改变 uniques = df_train[col].unique() 一切正常。

我的问题是为什么会发生这种情况，因为uniques = df[col].unique()和df_test只是同一数据的两部分，它们包含相同的标签（即使可能有一个标签不会发生在测试中）因此我的版本应该工作得很好（但不是）。

提前致谢。

熊猫手动标签编码

0 个答案: