将数组扩展为dask数据框中的列

时间:2019-10-14 21:32:50

标签: python dask

我具有使用以下键的avro数据:“ id,标签,功能”。 id和label是字符串,而features是浮点数的缓冲区。

import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))

我最终得到了这个MCVE

  label id         features
0  good  a  [1.0, 0.0, 0.0]
1   bad  b  [1.0, 0.0, 0.0]
2  good  c  [0.0, 0.0, 0.0]
3   bad  d  [1.0, 0.0, 1.0]
4  good  e  [0.0, 0.0, 0.0]

我想要的输出将是:

  label id   f1   f2   f3
0  good  a  1.0  0.0  0.0
1   bad  b  1.0  0.0  0.0
2  good  c  0.0  0.0  0.0
3   bad  d  1.0  0.0  1.0
4  good  e  0.0  0.0  0.0

我尝试了一些类似大熊猫的方式,即df[['f1','f2','f3']] = df.features.apply(pd.Series)在大熊猫中不起作用。

我可以像

这样的循环遍历
for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])

但是在实际的用例中,我具有成千上万的功能,因此遍历了数据集数千次。

达到预期结果的最佳方法是什么?

1 个答案:

答案 0 :(得分:0)

In [68]: import string
    ...: import numpy as np
    ...: import pandas as pd

In [69]: M, N = 100, 100
    ...: labels = np.random.choice(['good', 'bad'], size=M)
    ...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
    ...: features = np.empty((M,), dtype=object)
    ...: features[:] = list(map(list, np.random.randn(M, N)))
    ...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
    ...: df1 = df.copy()

In [70]: %%time
    ...: columns = [f"f{i:04d}" for i in range(N)]
    ...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 13.9 ms

In [71]: M, N = 1000, 1000
    ...: labels = np.random.choice(['good', 'bad'], size=M)
    ...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
    ...: features = np.empty((M,), dtype=object)
    ...: features[:] = list(map(list, np.random.randn(M, N)))
    ...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
    ...: df1 = df.copy()

In [72]: %%time
    ...: columns = [f"f{i:04d}" for i in range(N)]
    ...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 627 ms

In [73]: df1.shape
Out[73]: (1000, 1002)

编辑: 比原始速度快2倍

In [79]: df2 = df.copy()

In [80]: %%time
    ...: features = df2.pop('features')
    ...: for i in range(N):
    ...:     df2[f'f{i:04d}'] = features.map(lambda x: x[i])
    ...:     
Wall time: 1.46 s

In [81]: df1.equals(df2)
Out[81]: True

编辑:编辑: 一种更快的构造DataFrame的方法比原始方法提高了8倍:

In [22]: df1 = df.copy()

In [23]: %%time
    ...: features = pd.DataFrame({f"f{i:04d}": np.asarray(row) for i, row in enumerate(df1.pop('features').to_numpy())})
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 165 ms