内存高效热编码熊猫

时间:2017-05-25 00:18:31

标签: python pandas

编辑:仍在寻找当两个数据集具有不同列时有效的答案!

我正在尝试对两个数据集同等中的特定列进行热编码。该列在不同的数据集中具有不同的值,因此简单的热编码将导致不同的列。预期结果:

DATASET A                           
col1    col2    target                  
a        1        1                 
b        2        2                 
c        2        3                 
d        3        3                 

DATASET B                           
col1    col2    target                  
d         2      2                  
h         4      3                  
g         2      2                  
b         3      3                  

After encoding col 1:                           

New dataset A                           

col2    target  a   b   c   d   h   g
1          1    1   0   0   0   0   0
2          2    0   1   0   0   0   0
2          3    0   0   1   0   0   0
3          3    0   0   0   1   0   0

New dataset B                           

col2    target  a   b   c   d   h   g
2          2    0   0   0   1   0   0
4          3    0   0   0   0   1   0
2          2    0   0   0   0   0   1
3          3    0   1   0   0   0   0

以下实现可行,但内存效率非常低,并且由于MemoryErrors而经常崩溃我的计算机。

 def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True,drop_first = True):
        print("Hot encoding {} for both datasets".format(column_name))
        cols_in_df_but_not_in_df2 = set(df[column_name]).difference(set(df2[column_name]))
        cols_in_df2_but_not_in_df = set(df2[column_name]).difference(set(df[column_name]))

        dummy_df_to_concat_to_df = pd.DataFrame(0,index=df.index,columns = cols_in_df2_but_not_in_df)
        dummy_df_to_concat_to_df2 = pd.DataFrame(0,index=df2.index,columns = cols_in_df_but_not_in_df2)

        dummy_df_to_concat_to_df = dummy_df_to_concat_to_df.to_sparse()
        dummy_df_to_concat_to_df2 = dummy_df_to_concat_to_df2.to_sparse()

        encoded = pd.get_dummies(df[column_name],sparse=sparse)
        encoded = pd.concat([encoded,dummy_df_to_concat_to_df],axis = 1)
        encoded_2 = pd.get_dummies(df2[column_name],sparse=sparse)
        encoded_2 = pd.concat([encoded_2,dummy_df_to_concat_to_df2],axis = 1)

        encoded_df = pd.concat([df,encoded],axis=1)
        encoded_df2 = pd.concat([df2,encoded_2],axis=1)

        del encoded_df[column_name]
        del encoded_df2[column_name]

        return encoded_df,encoded_df2

有更好的方法吗?

谢谢! :)

2 个答案:

答案 0 :(得分:1)

根据您的描述,只需在一次热编码之前附加数据帧即可完成此操作。

if

答案 1 :(得分:1)

您可以创建要编码Category类型列的列,并利用pandas方法(包括get_dummies方法)认为此类列可能具有未观察到的值的事实在任何特定的DataFrame中。这允许您避免两个DataFrame的任何合并/连接,并使该方法与一个DataFrame中是否存在任何列而不是两个都出现无关。 Categorical columns的文档。

我正在使用pandas v0.20.1。

import numpy as np
import pandas as pd
import string

dfa = pd.DataFrame.from_dict({
    'col1': np.random.choice([ltr for ltr in string.ascii_lowercase[:4]], 5)
    , 'col2b': np.random.choice([1, 2, 3], 5)
    , 'target': np.random.choice([1, 2, 3], 5)
    })

dfb = pd.DataFrame.from_dict({
    'col1': np.random.choice([ltr for ltr in string.ascii_lowercase[2:8]], 7)
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], 7)
    , 'target': np.random.choice([1, 2, 3], 7)
    })

DFA:

  col1  col2b  target
0    b      3       1
1    d      3       3
2    b      3       3
3    a      2       3
4    c      1       3

DFB:

  col1 col2b  target
0    g   foo       2
1    c   bar       1
2    h   baz       3
3    c   baz       3
4    d   baz       3
5    d   bar       2
6    d   foo       3

找到在两个DataFrame中观察到的col1值的并集:

col1b = set(dfb.col1.unique())
col1a = set(dfa.col1.unique())
combined_cats = list(col1a.union(col1b))

在两个DataFrame上相同地定义col1的允许值:

# Use these statements if `col1` is a 'Category' dtype.
# dfa['col1'] = dfa.col1.cat.set_categories(combined_cats)
# dfb['col1'] = dfb.col1.cat.set_categories(combined_cats)
# Otherwise, use these statements.
dfa['col1'] = dfa.col1.astype('category', categories=combined_cats)
dfb['col1'] = dfb.col1.astype('category', categories=combined_cats)

newdfa = pd.get_dummies(dfa, columns=['col1'])
newdfb = pd.get_dummies(dfb, columns=['col1'])

newdfa:

   col2b  target  col1_g  col1_b  col1_c  col1_d  col1_h  col1_a
0      3       1       0       1       0       0       0       0
1      3       3       0       0       0       1       0       0
2      3       3       0       1       0       0       0       0
3      2       3       0       0       0       0       0       1
4      1       3       0       0       1       0       0       0

newdfb:

  col2b  target  col1_g  col1_b  col1_c  col1_d  col1_h  col1_a
0   foo       2       1       0       0       0       0       0
1   bar       1       0       0       1       0       0       0
2   baz       3       0       0       0       0       1       0
3   baz       3       0       0       1       0       0       0
4   baz       3       0       0       0       1       0       0
5   bar       2       0       0       0       1       0       0
6   foo       3       0       0       0       1       0       0