我试图重命名dtype'类别的类别' dask数据帧的列,从1到len(类别)的一系列数字。
在熊猫中,我这样做:
df['name'] = dd.Categorical(df.name).codes
但是在dask中这不起作用:
Traceback (most recent call last):
File "example.py", line 47, in <module>
sys.exit(main(sys.argv))
File "example.py", line 25, in main
df['name'] = dd.Categorical(df.name).codes
AttributeError: module 'dask.dataframe' has no attribute 'Categorical'
所以我尝试按照https://github.com/jenkinsci/workflow-cps-plugin/blob/master/README.md中的解释来获取类别并进行设置。
df['name'] = df['name'].astype('category')
cats = df.name.cat.categories
df.name.cat.categories = range(1, len(cats))
但这也产生了一个例外:
Traceback (most recent call last):
File "example.py", line 50, in <module>
sys.exit(main(sys.argv))
File "example.py", line 26, in main
cats = df.name.cat.categories
File "[...]/dask/dataframe/core.py", line 3207, in __getattr__
return self._property_map(key)
File "[...]/dask/dataframe/core.py", line 3186, in _property_map
out = self.getattr(self._series._meta_nonempty, key)
File "[...]/dask/dataframe/core.py", line 258, in _meta_nonempty
return meta_nonempty(self._meta)
File "[...]/dask/dataframe/utils.py", line 329, in meta_nonempty
return _nonempty_series(x, idx)
File "[...]/dask/dataframe/utils.py", line 308, in _nonempty_series
entry = s.cat.categories[0]
File "[...]/pandas-0.19.0-py3.5-linux-x86_64.egg/pandas/indexes/base.py", line 1393, in __getitem__
return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0
如何在dask dataframe列中重命名类别?
答案 0 :(得分:0)
您可能希望查看df.column.cat.codes
,其中包含您要查找的数字。让我们通过一个例子:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'b', 'a']})
In [3]: df['x'] = df.x.astype('category')
In [4]: df
Out[4]:
x
0 a
1 b
2 a
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=2)
.cat.codes
属性In [7]: ddf.x.cat.codes
Out[7]:
dd.Series<getattr..., npartitions=1, divisions=(0, 2)>
Dask Series Structure:
divisions
0 int8
2 ...
dtype: int8
In [8]: ddf.x.cat.codes.compute()
Out[8]:
0 0
1 1
2 0
dtype: int8
In [9]: ddf['x'] = ddf.x.cat.codes
In [10]: ddf.compute()
Out[10]:
x
0 0
1 1
2 0