将现有类别从一个Dask DataFrame应用到Dask DataFrame

时间:2017-11-03 03:33:43

标签: python dask

import pandas as pd
import dask.dataframe as dd

a = pd.DataFrame({'A':[100,102,101,99],'B':[1789,1890,1700,1980]})
b = pd.DataFrame({'A':[100,102,104,105],'B':[1230,1890,1700,1980]})

da = dd.from_pandas(a, npartitions=2)
db = dd.from_pandas(b, npartitions=2)

da = da.categorize()

我的问题是如何将da的类别应用于db,以使db数据框格是分类的,并且具有值A:[100,102,nan,nan]B:[nan,1890,1700,1980] < / p>

另一个问题是在上述工作之后如何用他们的代码替换分类值

这对于已经分解为训练和测试的数据至关重要。请帮助。

1 个答案:

答案 0 :(得分:1)

使用pandas 0.21.0(最近发布)和github的dask master是最干净的。这允许dask使用最近改进的CategoricalDtype

In [1]: %paste
import pandas as pd
import dask.dataframe as dd

a = pd.DataFrame({'A':[100,102,101,99],'B':[1789,1890,1700,1980]})
b = pd.DataFrame({'A':[100,102,104,105],'B':[1230,1890,1700,1980]})

da = dd.from_pandas(a, npartitions=2)
db = dd.from_pandas(b, npartitions=2)
## -- End pasted text --

In [2]: da2 = da.categorize(columns=['A', 'B'])

In [3]: db2 = db.astype({'A': da2.A.dtype, 'B': da2.B.dtype})

In [4]: db2
Out[4]:
Dask DataFrame Structure:
                             A                B
npartitions=2
0              category[known]  category[known]
2                          ...              ...
3                          ...              ...
Dask Name: astype, 4 tasks

In [5]: db2.compute()
Out[5]:
       A       B
0  100.0     NaN
1  102.0  1890.0
2    NaN  1700.0
3    NaN  1980.0