熊猫手动标签编码

时间:2018-01-01 00:41:07

标签: python pandas

大家好,祝大家新年快乐。

当我尝试手动编码某些标签时,我发现了一些奇怪的熊猫行为。如果有人能够解释为什么会这样,那将是非常棒的。

所以这是我的代码

<ul>
    <li>Item 1</li>
    <li>Item 2
        <ul>
            <li>Nested item 1</li>
            <li>Nested item 2</li>
        </ul>
    </li>
</ul>

所以我的问题是,测试数据中import numpy as np import pandas as pd from seaborn import load_dataset as data titanic = data('titanic') # Set a random set s.t. you get the same as i get np.random.seed(12345) # Draw 91 Observations as test df_test = titanic.sample(n = 91 ) # Take the remaining as trainings data df_train = titanic.drop(df_test.index) # Delet the old data to free the memory del titanic # Save references of the data in a list s.t. we can loop over them frames = [df_train, df_test] # loop over the columns and the dataframes an set them as categorical for df in frames: for col in 'sex embarked embark_town alone who'.split(): # save the destinct values of the training data # HERE IS THE MISTAKE uniques = df_train[col].unique() # encode those unique values numerically for n, uni in enumerate(uniques): df.loc[df[col] == uni , col] = n 变量的标签不会被覆盖。 我得到了

embarked

而不是:

print(df_train.embarked.unique() )
[0 1 2 nan]

print(df_test.embarked.unique()) 
['S' 'Q' 'C']

通过一些实验,我发现行print(df_train.embarked.unique() ) [0 1 2 nan] print(df_test.embarked.unique()) [0 1 2 ] 就是原因。如果这会改变 uniques = df_train[col].unique() 一切正常。

我的问题是为什么会发生这种情况,因为uniques = df[col].unique()df_test只是同一数据的两部分,它们包含相同的标签(即使可能有一个标签不会发生在测试中)因此我的版本应该工作得很好(但不是)。

提前致谢。

0 个答案:

没有答案