如何在dask中转换/重命名类别

时间:2016-10-19 11:46:31

标签: python dask

我试图重命名dtype'类别的类别' dask数据帧的列,从1到len(类别)的一系列数字。

在熊猫中,我这样做:

df['name'] = dd.Categorical(df.name).codes

但是在dask中这不起作用:

Traceback (most recent call last):
  File "example.py", line 47, in <module>
    sys.exit(main(sys.argv))
  File "example.py", line 25, in main
    df['name'] = dd.Categorical(df.name).codes
AttributeError: module 'dask.dataframe' has no attribute 'Categorical'

所以我尝试按照https://github.com/jenkinsci/workflow-cps-plugin/blob/master/README.md中的解释来获取类别并进行设置。

df['name'] = df['name'].astype('category')
cats = df.name.cat.categories
df.name.cat.categories = range(1, len(cats))

但这也产生了一个例外:

Traceback (most recent call last):
  File "example.py", line 50, in <module>
    sys.exit(main(sys.argv))
  File "example.py", line 26, in main
    cats = df.name.cat.categories
  File "[...]/dask/dataframe/core.py", line 3207, in __getattr__
    return self._property_map(key)
  File "[...]/dask/dataframe/core.py", line 3186, in _property_map
    out = self.getattr(self._series._meta_nonempty, key)
  File "[...]/dask/dataframe/core.py", line 258, in _meta_nonempty
    return meta_nonempty(self._meta)
  File "[...]/dask/dataframe/utils.py", line 329, in meta_nonempty
    return _nonempty_series(x, idx)
  File "[...]/dask/dataframe/utils.py", line 308, in _nonempty_series
    entry = s.cat.categories[0]
  File "[...]/pandas-0.19.0-py3.5-linux-x86_64.egg/pandas/indexes/base.py", line 1393, in __getitem__
    return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

如何在dask dataframe列中重命名类别?

1 个答案:

答案 0 :(得分:0)

您可能希望查看df.column.cat.codes,其中包含您要查找的数字。让我们通过一个例子:

在Pandas中创建玩具数据集

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'b', 'a']})

In [3]: df['x'] = df.x.astype('category')

In [4]: df
Out[4]: 
   x
0  a
1  b
2  a

转换为Dask.dataframe

In [5]: import dask.dataframe as dd

In [6]: ddf = dd.from_pandas(df, npartitions=2)

检查.cat.codes属性

In [7]: ddf.x.cat.codes
Out[7]: 
dd.Series<getattr..., npartitions=1, divisions=(0, 2)>

Dask Series Structure:
divisions
0    int8
2     ...
dtype: int8

In [8]: ddf.x.cat.codes.compute()
Out[8]: 
0    0
1    1
2    0
dtype: int8

使用代码系列

覆盖类别系列
In [9]: ddf['x'] = ddf.x.cat.codes

In [10]: ddf.compute()
Out[10]: 
   x
0  0
1  1
2  0