Question

我正在处理分类数据的大型DataFrame，我发现当我在两个数据帧上使用pandas.merge时，任何列的分类数据都会自动向上转换为更大的数据类型。（这可以大大增加RAM消耗。）一个简单的例子来说明：

编辑：做了一个更合适的例子

import pandas
import numpy

df1 = pandas.DataFrame(
    {'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0], 
     'value1': pandas.Categorical(numpy.random.randint(0, 2, 20))})

df2 = pandas.DataFrame(
    {'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1],  
     'value2': pandas.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])})

result = pandas.merge(df1, df2, on="ID")
result.dtypes


Out []:
ID         int32
value1     int64
value2    object
dtype: object

我希望value1和value2在结果DataFrame中保持分类。转换为对象类型的字符串标签可能特别昂贵。

来自https://github.com/pydata/pandas/issues/8938这可能与预期有关吗？反正有没有避免这个？

Answer 1

作为解决方法，您可以将分类列转换为整数值代码，并将列的映射存储到dict中的类别。例如，

def decat(df):
    """
    Convert categorical columns to (integer) codes; return the categories in catmap
    """
    catmap = dict()
    for col, dtype in df.dtypes.iteritems():
        if com.is_categorical_dtype(dtype):
            c = df[col].cat
            catmap[col] = c.categories
            df[col] = c.codes
    return df, catmap

In [304]: df
Out[304]: 
   ID value2
0   5      c
1   3      a
2   6      c
3   7      a
4   0      c
5   4      b
6   8      b
7   2      a
8   9      a
9   1      b

In [305]: df, catmap = decat(df)

In [306]: df
Out[306]: 
   ID  value2
0   5       2
1   3       0
2   6       2
3   7       0
4   0       2
5   4       1
6   8       1
7   2       0
8   9       0
9   1       1

In [307]: catmap
Out[307]: {'value2': Index([u'a', u'b', u'c'], dtype='object')}

现在您可以照常合并，因为合并整数值列没有问题。

稍后，您可以使用catmap中的数据重新构建分类列：

def recat(df, catmap):
    """
    Use catmap to reconstitute columns in df to categorical dtype
    """
    for col, categories in catmap.iteritems():
        df[col] = pd.Categorical(categories[df[col]])
        df[col].cat.categories = categories
    return df

import numpy as np
import pandas as pd
import pandas.core.common as com

df1 = pd.DataFrame(
    {'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0],
                dtype='int32'), 
     'value1': pd.Categorical(np.random.randint(0, 2, 20))})

df2 = pd.DataFrame(
    {'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1], dtype='int32'),  
     'value2': pd.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])})

def decat(df):
    """
    Convert categorical columns to (integer) codes; return the categories in catmap
    """
    catmap = dict()
    for col, dtype in df.dtypes.iteritems():
        if com.is_categorical_dtype(dtype):
            c = df[col].cat
            catmap[col] = c.categories
            df[col] = c.codes
    return df, catmap

def recat(df, catmap):
    """
    Use catmap to reconstitute columns in df to categorical dtype
    """
    for col, categories in catmap.iteritems():
        df[col] = pd.Categorical(categories[df[col]])
        df[col].cat.categories = categories
    return df

def mergecat(left, right, *args, **kwargs):
    left, left_catmap = decat(left)
    right, right_catmap = decat(right)
    left_catmap.update(right_catmap)
    result = pd.merge(left, right, *args, **kwargs)
    return recat(result, left_catmap)

result = mergecat(df1, df2, on='ID')
result.info()

产量

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 3 columns):
ID        20 non-null int32
value1    20 non-null category
value2    20 non-null category
dtypes: category(2), int32(1)
memory usage: 320.0 bytes

Answer 2

我可能会错过您的目标，但目的是让用户在需要时转换为类别（或不转换）。我认为在这种特殊情况下，可以自动完成。说实话，分类转换无论如何都会在最后完成，所以这并不能真正为你节省任何东西（通过在 merge 中进行）。

In [57]: result = pandas.merge(df1, df2, on="ID")

In [58]: result['value1'] = result['value1'].astype('category')

In [59]: result['value2'] = result['value2'].astype('category')

In [60]: result
Out[60]: 
    ID value1 value2
0    5      0      c
1    5      1      c
2    3      0      a
3    3      1      a
4    6      0      c
5    6      0      c
6    7      0      a
7    7      1      a
8    0      1      c
9    0      1      c
10   4      1      b
11   4      1      b
12   8      0      b
13   8      1      b
14   2      1      a
15   2      1      a
16   9      0      a
17   9      1      a
18   1      0      b
19   1      1      b

In [61]: result.dtypes
Out[61]: 
ID           int64
value1    category
value2    category
dtype: object

In [62]: result.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 3 columns):
ID        20 non-null int64
value1    20 non-null category
value2    20 non-null category
dtypes: category(2), int64(1)
memory usage: 400.0 byte

Answer 3

以下是恢复类别元数据的代码段：

def copy_category_metadata(df_with_categories, df_without_categories):
    import pandas
    for col_name, dtype in df_with_categories.dtypes.iteritems():
        if str(dtype)=="category":
            if col_name in df_without_categories.columns:
                if str(df_without_categories[col_name].dtype)=="category":
                    print "{} - Already a category".format(col_name)
                else:
                    print "{} - Making a category".format(col_name)
                    # make the column into a Categorical using the other dataframe's metadata
                    df_without_categories[col_name] = pandas.Categorical(
                        df_without_categories[col_name],
                        categories = df_with_categories[col_name].cat.categories,
                        ordered = df_with_categories[col_name].cat.ordered)

使用示例：

dfA # some data frame with categories
dfB # another data frame
df_merged = dfA.merge(dfB) # merge result, no categories
copy_category_metadata(dfA, df_merged)

Answer 4

您可以将列类别分为索引 （pandas.Series.cat.categories）和代码 （pandas.Series .cat.codes），合并数据框，然后使用 from_codes 函数重新创建分类系列。这很丑陋，但似乎速度快且内存效率高。

# categorical indices
indices = [x.cat.categories for x in [df1.value1, df2.value2]]
# in-place setting columns with their categorical codes
for df, col in zip([df1, df2], ['value1', 'value2']):
    df[col] = df[col].cat.codes
# merging updated dataframes
result = pandas.merge(df1, df2, on='ID')
# recreating categorical series
for col, index in zip(['value1', 'value2'], indices):
    result[col] = pandas.Categorical.from_codes(result[col], index)

Python pandas：merge失去了分类列

4 个答案: