编辑:仍在寻找当两个数据集具有不同列时有效的答案!
我正在尝试对两个数据集同等中的特定列进行热编码。该列在不同的数据集中具有不同的值,因此简单的热编码将导致不同的列。预期结果:
DATASET A
col1 col2 target
a 1 1
b 2 2
c 2 3
d 3 3
DATASET B
col1 col2 target
d 2 2
h 4 3
g 2 2
b 3 3
After encoding col 1:
New dataset A
col2 target a b c d h g
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 3 0 0 0 1 0 0
New dataset B
col2 target a b c d h g
2 2 0 0 0 1 0 0
4 3 0 0 0 0 1 0
2 2 0 0 0 0 0 1
3 3 0 1 0 0 0 0
以下实现可行,但内存效率非常低,并且由于MemoryErrors而经常崩溃我的计算机。
def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True,drop_first = True):
print("Hot encoding {} for both datasets".format(column_name))
cols_in_df_but_not_in_df2 = set(df[column_name]).difference(set(df2[column_name]))
cols_in_df2_but_not_in_df = set(df2[column_name]).difference(set(df[column_name]))
dummy_df_to_concat_to_df = pd.DataFrame(0,index=df.index,columns = cols_in_df2_but_not_in_df)
dummy_df_to_concat_to_df2 = pd.DataFrame(0,index=df2.index,columns = cols_in_df_but_not_in_df2)
dummy_df_to_concat_to_df = dummy_df_to_concat_to_df.to_sparse()
dummy_df_to_concat_to_df2 = dummy_df_to_concat_to_df2.to_sparse()
encoded = pd.get_dummies(df[column_name],sparse=sparse)
encoded = pd.concat([encoded,dummy_df_to_concat_to_df],axis = 1)
encoded_2 = pd.get_dummies(df2[column_name],sparse=sparse)
encoded_2 = pd.concat([encoded_2,dummy_df_to_concat_to_df2],axis = 1)
encoded_df = pd.concat([df,encoded],axis=1)
encoded_df2 = pd.concat([df2,encoded_2],axis=1)
del encoded_df[column_name]
del encoded_df2[column_name]
return encoded_df,encoded_df2
有更好的方法吗?
谢谢! :)
答案 0 :(得分:1)
根据您的描述,只需在一次热编码之前附加数据帧即可完成此操作。
if
答案 1 :(得分:1)
您可以创建要编码Category
类型列的列,并利用pandas方法(包括get_dummies
方法)认为此类列可能具有未观察到的值的事实在任何特定的DataFrame中。这允许您避免两个DataFrame的任何合并/连接,并使该方法与一个DataFrame中是否存在任何列而不是两个都出现无关。 Categorical columns的文档。
我正在使用pandas v0.20.1。
import numpy as np
import pandas as pd
import string
dfa = pd.DataFrame.from_dict({
'col1': np.random.choice([ltr for ltr in string.ascii_lowercase[:4]], 5)
, 'col2b': np.random.choice([1, 2, 3], 5)
, 'target': np.random.choice([1, 2, 3], 5)
})
dfb = pd.DataFrame.from_dict({
'col1': np.random.choice([ltr for ltr in string.ascii_lowercase[2:8]], 7)
, 'col2b': np.random.choice(['foo', 'bar', 'baz'], 7)
, 'target': np.random.choice([1, 2, 3], 7)
})
DFA:
col1 col2b target
0 b 3 1
1 d 3 3
2 b 3 3
3 a 2 3
4 c 1 3
DFB:
col1 col2b target
0 g foo 2
1 c bar 1
2 h baz 3
3 c baz 3
4 d baz 3
5 d bar 2
6 d foo 3
找到在两个DataFrame中观察到的col1
值的并集:
col1b = set(dfb.col1.unique())
col1a = set(dfa.col1.unique())
combined_cats = list(col1a.union(col1b))
在两个DataFrame上相同地定义col1
的允许值:
# Use these statements if `col1` is a 'Category' dtype.
# dfa['col1'] = dfa.col1.cat.set_categories(combined_cats)
# dfb['col1'] = dfb.col1.cat.set_categories(combined_cats)
# Otherwise, use these statements.
dfa['col1'] = dfa.col1.astype('category', categories=combined_cats)
dfb['col1'] = dfb.col1.astype('category', categories=combined_cats)
newdfa = pd.get_dummies(dfa, columns=['col1'])
newdfb = pd.get_dummies(dfb, columns=['col1'])
newdfa:
col2b target col1_g col1_b col1_c col1_d col1_h col1_a
0 3 1 0 1 0 0 0 0
1 3 3 0 0 0 1 0 0
2 3 3 0 1 0 0 0 0
3 2 3 0 0 0 0 0 1
4 1 3 0 0 1 0 0 0
newdfb:
col2b target col1_g col1_b col1_c col1_d col1_h col1_a
0 foo 2 1 0 0 0 0 0
1 bar 1 0 0 1 0 0 0
2 baz 3 0 0 0 0 1 0
3 baz 3 0 0 1 0 0 0
4 baz 3 0 0 0 1 0 0
5 bar 2 0 0 0 1 0 0
6 foo 3 0 0 0 1 0 0