我正在处理分类数据的大型DataFrame,我发现当我在两个数据帧上使用pandas.merge时,任何列的分类数据都会自动向上转换为更大的数据类型。 (这可以大大增加RAM消耗。)一个简单的例子来说明:
编辑:做了一个更合适的例子
import pandas
import numpy
df1 = pandas.DataFrame(
{'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0],
'value1': pandas.Categorical(numpy.random.randint(0, 2, 20))})
df2 = pandas.DataFrame(
{'ID': [5, 3, 6, 7, 0, 4, 8, 2, 9, 1],
'value2': pandas.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])})
result = pandas.merge(df1, df2, on="ID")
result.dtypes
Out []:
ID int32
value1 int64
value2 object
dtype: object
我希望value1和value2在结果DataFrame中保持分类。转换为对象类型的字符串标签可能特别昂贵。
来自https://github.com/pydata/pandas/issues/8938这可能与预期有关吗?反正有没有避免这个?
答案 0 :(得分:1)
作为解决方法,您可以将分类列转换为整数值代码, 并将列的映射存储到dict中的类别。例如,
def decat(df):
"""
Convert categorical columns to (integer) codes; return the categories in catmap
"""
catmap = dict()
for col, dtype in df.dtypes.iteritems():
if com.is_categorical_dtype(dtype):
c = df[col].cat
catmap[col] = c.categories
df[col] = c.codes
return df, catmap
In [304]: df
Out[304]:
ID value2
0 5 c
1 3 a
2 6 c
3 7 a
4 0 c
5 4 b
6 8 b
7 2 a
8 9 a
9 1 b
In [305]: df, catmap = decat(df)
In [306]: df
Out[306]:
ID value2
0 5 2
1 3 0
2 6 2
3 7 0
4 0 2
5 4 1
6 8 1
7 2 0
8 9 0
9 1 1
In [307]: catmap
Out[307]: {'value2': Index([u'a', u'b', u'c'], dtype='object')}
现在您可以照常合并,因为合并整数值列没有问题。
稍后,您可以使用catmap
中的数据重新构建分类列:
def recat(df, catmap):
"""
Use catmap to reconstitute columns in df to categorical dtype
"""
for col, categories in catmap.iteritems():
df[col] = pd.Categorical(categories[df[col]])
df[col].cat.categories = categories
return df
import numpy as np
import pandas as pd
import pandas.core.common as com
df1 = pd.DataFrame(
{'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1, 6, 5, 4, 9, 7, 2, 1, 8, 3, 0],
dtype='int32'),
'value1': pd.Categorical(np.random.randint(0, 2, 20))})
df2 = pd.DataFrame(
{'ID': np.array([5, 3, 6, 7, 0, 4, 8, 2, 9, 1], dtype='int32'),
'value2': pd.Categorical(['c', 'a', 'c', 'a', 'c', 'b', 'b', 'a', 'a', 'b'])})
def decat(df):
"""
Convert categorical columns to (integer) codes; return the categories in catmap
"""
catmap = dict()
for col, dtype in df.dtypes.iteritems():
if com.is_categorical_dtype(dtype):
c = df[col].cat
catmap[col] = c.categories
df[col] = c.codes
return df, catmap
def recat(df, catmap):
"""
Use catmap to reconstitute columns in df to categorical dtype
"""
for col, categories in catmap.iteritems():
df[col] = pd.Categorical(categories[df[col]])
df[col].cat.categories = categories
return df
def mergecat(left, right, *args, **kwargs):
left, left_catmap = decat(left)
right, right_catmap = decat(right)
left_catmap.update(right_catmap)
result = pd.merge(left, right, *args, **kwargs)
return recat(result, left_catmap)
result = mergecat(df1, df2, on='ID')
result.info()
产量
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 3 columns):
ID 20 non-null int32
value1 20 non-null category
value2 20 non-null category
dtypes: category(2), int32(1)
memory usage: 320.0 bytes
答案 1 :(得分:1)
我可能会错过您的目标,但目的是让用户在需要时转换为类别(或不转换)。我认为在这种特殊情况下,可以自动完成。说实话,分类转换无论如何都会在最后完成,所以这并不能真正为你节省任何东西(通过在 merge 中进行)。
In [57]: result = pandas.merge(df1, df2, on="ID")
In [58]: result['value1'] = result['value1'].astype('category')
In [59]: result['value2'] = result['value2'].astype('category')
In [60]: result
Out[60]:
ID value1 value2
0 5 0 c
1 5 1 c
2 3 0 a
3 3 1 a
4 6 0 c
5 6 0 c
6 7 0 a
7 7 1 a
8 0 1 c
9 0 1 c
10 4 1 b
11 4 1 b
12 8 0 b
13 8 1 b
14 2 1 a
15 2 1 a
16 9 0 a
17 9 1 a
18 1 0 b
19 1 1 b
In [61]: result.dtypes
Out[61]:
ID int64
value1 category
value2 category
dtype: object
In [62]: result.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 3 columns):
ID 20 non-null int64
value1 20 non-null category
value2 20 non-null category
dtypes: category(2), int64(1)
memory usage: 400.0 byte
答案 2 :(得分:0)
以下是恢复类别元数据的代码段:
def copy_category_metadata(df_with_categories, df_without_categories):
import pandas
for col_name, dtype in df_with_categories.dtypes.iteritems():
if str(dtype)=="category":
if col_name in df_without_categories.columns:
if str(df_without_categories[col_name].dtype)=="category":
print "{} - Already a category".format(col_name)
else:
print "{} - Making a category".format(col_name)
# make the column into a Categorical using the other dataframe's metadata
df_without_categories[col_name] = pandas.Categorical(
df_without_categories[col_name],
categories = df_with_categories[col_name].cat.categories,
ordered = df_with_categories[col_name].cat.ordered)
使用示例:
dfA # some data frame with categories
dfB # another data frame
df_merged = dfA.merge(dfB) # merge result, no categories
copy_category_metadata(dfA, df_merged)
答案 3 :(得分:0)
您可以将列类别分为索引 (pandas.Series.cat.categories)和代码 (pandas.Series .cat.codes),合并数据框,然后使用 from_codes 函数重新创建分类系列。这很丑陋,但似乎速度快且内存效率高。
# categorical indices
indices = [x.cat.categories for x in [df1.value1, df2.value2]]
# in-place setting columns with their categorical codes
for df, col in zip([df1, df2], ['value1', 'value2']):
df[col] = df[col].cat.codes
# merging updated dataframes
result = pandas.merge(df1, df2, on='ID')
# recreating categorical series
for col, index in zip(['value1', 'value2'], indices):
result[col] = pandas.Categorical.from_codes(result[col], index)