Question

我可以将pandas字符串列转换为Categorical，但是当我尝试将其作为新的DataFrame列插入时，它似乎会被转换回str系列：

train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])

>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'

猜测这是因为分类不映射到任何numpy dtype;所以我必须将其转换为某种int类型，从而失去因子标签＆lt; - ＆gt;级别关联？存储级别＆lt; - ＆gt;标签关联并保留转换回来的能力的最优雅的解决方法是什么？（只需存储为here之类的字典，并在需要时手动转换？）我认为Categorical is still not a first-class datatype for DataFrame与R。

不同

（使用pandas 0.10.1，numpy 1.6.2，python 2.7.3 - 所有内容的最新macports版本。）

Answer 1

我发现只有<0.15> pandas 0.15之前的解决方法如下：

列转换为分类器的分类，但numpy会立即将级别强制转换为int，从而丢失因子信息
因此将因子存储在数据框外的全局变量中

train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical

train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe

[更新：pandas 0.15+ added decent support for Categorical]

Answer 2

标签＆lt; - ＆gt;级别存储在索引对象中。

将整数数组转换为字符串数组：index [integer_array]
将字符串数组转换为整数数组：index.get_indexer（string_array）

以下是一些例子：

In [56]:

c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])

idx = c.levels

In [57]:

idx[[1,2,1,2,3]]

Out[57]:

Index([b, c, b, c, d], dtype=object)

In [58]:

idx.get_indexer(["a","c","d","e","a"])

Out[58]:

array([0, 2, 3, 4, 0])

如何从字符串列生成分类的pandas DataFrame列？

2 个答案: