当我尝试手动编码某些标签时,我发现了一些奇怪的熊猫行为。如果有人能够解释为什么会这样,那将是非常棒的。
所以这是我的代码
<ul>
<li>Item 1</li>
<li>Item 2
<ul>
<li>Nested item 1</li>
<li>Nested item 2</li>
</ul>
</li>
</ul>
所以我的问题是,测试数据中import numpy as np
import pandas as pd
from seaborn import load_dataset as data
titanic = data('titanic')
# Set a random set s.t. you get the same as i get
np.random.seed(12345)
# Draw 91 Observations as test
df_test = titanic.sample(n = 91 )
# Take the remaining as trainings data
df_train = titanic.drop(df_test.index)
# Delet the old data to free the memory
del titanic
# Save references of the data in a list s.t. we can loop over them
frames = [df_train, df_test]
# loop over the columns and the dataframes an set them as categorical
for df in frames:
for col in 'sex embarked embark_town alone who'.split():
# save the destinct values of the training data
# HERE IS THE MISTAKE
uniques = df_train[col].unique()
# encode those unique values numerically
for n, uni in enumerate(uniques):
df.loc[df[col] == uni , col] = n
变量的标签不会被覆盖。
我得到了
embarked
而不是:
print(df_train.embarked.unique() )
[0 1 2 nan]
print(df_test.embarked.unique())
['S' 'Q' 'C']
通过一些实验,我发现行print(df_train.embarked.unique() )
[0 1 2 nan]
print(df_test.embarked.unique())
[0 1 2 ]
就是原因。如果这会改变
uniques = df_train[col].unique()
一切正常。
我的问题是为什么会发生这种情况,因为uniques = df[col].unique()
和df_test
只是同一数据的两部分,它们包含相同的标签(即使可能有一个标签不会发生在测试中)因此我的版本应该工作得很好(但不是)。
提前致谢。