import pandas as pd
import dask.dataframe as dd
a = pd.DataFrame({'A':[100,102,101,99],'B':[1789,1890,1700,1980]})
b = pd.DataFrame({'A':[100,102,104,105],'B':[1230,1890,1700,1980]})
da = dd.from_pandas(a, npartitions=2)
db = dd.from_pandas(b, npartitions=2)
da = da.categorize()
我的问题是如何将da
的类别应用于db
,以使db
数据框格是分类的,并且具有值A:[100,102,nan,nan]
和B:[nan,1890,1700,1980]
< / p>
另一个问题是在上述工作之后如何用他们的代码替换分类值
这对于已经分解为训练和测试的数据至关重要。请帮助。
答案 0 :(得分:1)
使用pandas 0.21.0(最近发布)和github的dask master是最干净的。这允许dask使用最近改进的CategoricalDtype:
In [1]: %paste
import pandas as pd
import dask.dataframe as dd
a = pd.DataFrame({'A':[100,102,101,99],'B':[1789,1890,1700,1980]})
b = pd.DataFrame({'A':[100,102,104,105],'B':[1230,1890,1700,1980]})
da = dd.from_pandas(a, npartitions=2)
db = dd.from_pandas(b, npartitions=2)
## -- End pasted text --
In [2]: da2 = da.categorize(columns=['A', 'B'])
In [3]: db2 = db.astype({'A': da2.A.dtype, 'B': da2.B.dtype})
In [4]: db2
Out[4]:
Dask DataFrame Structure:
A B
npartitions=2
0 category[known] category[known]
2 ... ...
3 ... ...
Dask Name: astype, 4 tasks
In [5]: db2.compute()
Out[5]:
A B
0 100.0 NaN
1 102.0 1890.0
2 NaN 1700.0
3 NaN 1980.0