我正在尝试通过get_dummies
使用dask
,但它不会转换我的变量,也不会出错:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical)
>>> daskDataDummies.head()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
>>> daskDataDummies.compute()
gender
0 M
1 NaN
2 NaN
3 F
4 NaN
5 F
6 M
7 F
8 M
9 F
>>>
pandas
等效(在新终端中运行以防万一)是:
>>> import pandas as pd
>>> df_p = pd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_p.head()
uid gender
0 1 M
1 2 NaN
2 3 NaN
3 4 F
4 5 NaN
>>> pandasDataCategorical = df_p[['gender']]
>>> pandasDataDummies = pd.get_dummies(pandasDataCategorical)
>>> pandasDataDummies.head()
gender_F gender_M
0 0.0 1.0
1 0.0 0.0
2 0.0 0.0
3 1.0 0.0
4 0.0 0.0
>>>
我对this resolved issue的理解是它应该有用,但是它是否需要首先被引入pandas
?如果是这样,它就违背了我使用它的目的,因为我的数据集(~500GB)不适合pandas
数据帧。我误读了吗? TIA。
答案 0 :(得分:1)
在尝试使用Categorical
之前,您需要将字符串列转换为get_dummies
。 This pull request添加了dask.dataframe.get_dummies
,如果您尝试传递object
(字符串)列,将错误,与pd.get_dummies
不同。
要获得Categorical
,您可以在.categorize
之前使用dd.get_dummies
,或者使用pandas> = 0.19,在CSV中使用dtype
关键字,如< / p>
df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})
这是一个小例子:
In [2]: import dask.dataframe as dd
In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)
In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
warnings.warn(msg.format(n, len(r)))
Out[4]:
A
0 a
1 b
2 a
In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
68 if columns is None:
69 if (data.dtypes == 'object').any():
---> 70 raise NotImplementedError(not_cat_msg)
71 columns = data._meta.select_dtypes(include=['category']).columns
72 else:
NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.
In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
A_a A_b A_c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1
Dask需要get_dummies
的分类,因为它需要知道它需要创建的所有新的虚拟变量。大熊猫不必担心这个,因为你的所有数据都已经在内存中了。