Dask get_dummies不会变换变量

时间:2017-01-25 14:47:15

标签: python pandas dask dummy-variable

我正在尝试通过get_dummies使用dask,但它不会转换我的变量,也不会出错:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_d.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> daskDataCategorical = df_d[['gender']]
>>> daskDataDummies = dd.get_dummies(daskDataCategorical) 
>>> daskDataDummies.head()
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
>>> daskDataDummies.compute() 
  gender
0      M
1    NaN
2    NaN
3      F
4    NaN
5      F
6      M
7      F
8      M
9      F
>>>

pandas等效(在新终端中运行以防万一)是:

>>> import pandas as pd
>>> df_p = pd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv')
>>> df_p.head()
   uid gender
0    1      M
1    2    NaN
2    3    NaN
3    4      F
4    5    NaN
>>> pandasDataCategorical = df_p[['gender']]
>>> pandasDataDummies = pd.get_dummies(pandasDataCategorical)
>>> pandasDataDummies.head()
   gender_F  gender_M
0       0.0       1.0
1       0.0       0.0
2       0.0       0.0
3       1.0       0.0
4       0.0       0.0
>>> 

我对this resolved issue的理解是它应该有用,但是它是否需要首先被引入pandas?如果是这样,它就违背了我使用它的目的,因为我的数据集(~500GB)不适合pandas数据帧。我误读了吗? TIA。

1 个答案:

答案 0 :(得分:1)

在尝试使用Categorical之前,您需要将字符串列转换为get_dummiesThis pull request添加了dask.dataframe.get_dummies,如果您尝试传递object(字符串)列,错误,与pd.get_dummies不同。

要获得Categorical,您可以在.categorize之前使用dd.get_dummies,或者使用pandas&gt; = 0.19,在CSV中使用dtype关键字,如< / p>

df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv', dtype={"gender": "category"})

这是一个小例子:

In [2]: import dask.dataframe as dd

In [3]: bad = dd.from_pandas(pd.DataFrame({"A": ['a', 'b', 'a', 'b', 'c']}), npartitions=2)

In [4]: bad.head()
/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/core.py:3699: UserWarning: Insufficient elements for `head`. 5 elements requested, only 3 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[4]:
   A
0  a
1  b
2  a

In [5]: dd.get_dummies(bad)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-651de6dd308c> in <module>()
----> 1 dd.get_dummies(bad)

/Users/tom.augspurger/Envs/py3/lib/python3.6/site-packages/dask/dask/dataframe/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first)
     68         if columns is None:
     69             if (data.dtypes == 'object').any():
---> 70                 raise NotImplementedError(not_cat_msg)
     71             columns = data._meta.select_dtypes(include=['category']).columns
     72         else:

NotImplementedError: `get_dummies` with non-categorical dtypes is not supported. Please use `df.categorize()` beforehand to convert to categorical dtype.

In [7]: dd.get_dummies(bad.categorize()).compute()
Out[7]:
   A_a  A_b  A_c
0    1    0    0
1    0    1    0
2    1    0    0
3    0    1    0
4    0    0    1

Dask需要get_dummies的分类,因为它需要知道它需要创建的所有新的虚拟变量。大熊猫不必担心这个,因为你的所有数据都已经在内存中了。