Question

上下文：请考虑以下

import pandas as pd
X = pd.DataFrame({"A": [0, 1, 2, 3]})
Y = pd.DataFrame({"A": [5, 6, 7, 8]})

together= pd.concat([X.assign(s='x'), Y.assign(s='y')])

最后一行，我希望s的dtype为

cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])

当然，我可以做到

together.s = together.s.astype(cat_type)

但是，如果X和Y足够大，则会为中介机构节省大量内存，每次执行这些“连接”时，它都会从类别转换为字符串并返回。

问题：是否有一种（干净的）方法可以将类别中的单个值分配给数据框列，而无需支付转换为字符串并返回的惩罚？

当然，我关心的实际数据非常大。类别和字符串之间的差异导致分页到磁盘。

Answer 1

我认为您可以在categorical之前转换为concat：

cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])

X = X.assign(s='x')
X.s = X.s.astype(cat_type)

Y = Y.assign(s='x')
Y.s = Y.s.astype(cat_type)

together = pd.concat([X, Y])
print (together.dtypes)

A       int64
s    category
dtype: object

另一种解决方案是使用：

cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])
together= pd.concat([X.assign(s=pd.Categorical(['x'] * len(X), dtype=cat_type)), 
                     Y.assign(s=pd.Categorical(['y'] * len(Y), dtype=cat_type))])

print (together.dtypes)

A       int64
s    category
dtype: object

将类别值分配给pandas列中的所有行

1 个答案: