使用sklearn定标器覆盖dask数据框

时间:2019-07-15 22:32:26

标签: python arrays scikit-learn dask

我有以下dask数据框:

enter image description here

我要将sklearn缩放器E.G应用于LotArea列:

scaler = StandardScaler()
scaler.fit_transform(df[['LotArea']]) 

这将返回一个numpy数组:

array([[ 0.82160041],
       [ 1.59216945],
       [ 1.46485804],
       [-0.11648362],
       [-1.10613315],
       [ 0.34906243],
       [-0.23942507],
       [-0.11648362],
       [ 0.40033659],
       [-0.11706628],
       [-0.85762828],
       [-2.07480689]])

但是我不能将数据框更新为:

df[column] = (scaler.fit_transform(df[[column]]))

它返回以下错误:

TypeError: Column assignment doesn't support type numpy.ndarray

我尝试将其转换为dask数组,但结果相同:

df['LotArea'] = da.from_array(scaler.fit_transform(df[[column]]))

TypeError: Column assignment doesn't support type dask.array.core.Array

如何使用定标器更新数据框?

1 个答案:

答案 0 :(得分:1)

这归结为“如何将列添加到Dask DataFrame”。

In [22]: df = pd.DataFrame({"A": [1, 2, 3, 4]})

In [23]: ddf = dd.from_pandas(df, 2)

In [24]: b = da.from_array(np.array([1, 2, 3, 4]), chunks=2)

In [25]: ddf['B'] = dd.from_dask_array(b, index=ddf.index)

In [26]: ddf.head()
/Users/taugspurger/sandbox/dask/dask/dataframe/core.py:5724: UserWarning: Insufficient elements for `head`. 5 elements requested, only
2 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))
Out[26]:
   A  B
0  1  1
1  2  2

在Dask中,这可能会变得更容易。参见https://github.com/dask/dask/issues/5118