我具有使用以下键的avro数据:“ id,标签,功能”。 id和label是字符串,而features是浮点数的缓冲区。
import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))
我最终得到了这个MCVE
label id features
0 good a [1.0, 0.0, 0.0]
1 bad b [1.0, 0.0, 0.0]
2 good c [0.0, 0.0, 0.0]
3 bad d [1.0, 0.0, 1.0]
4 good e [0.0, 0.0, 0.0]
我想要的输出将是:
label id f1 f2 f3
0 good a 1.0 0.0 0.0
1 bad b 1.0 0.0 0.0
2 good c 0.0 0.0 0.0
3 bad d 1.0 0.0 1.0
4 good e 0.0 0.0 0.0
我尝试了一些类似大熊猫的方式,即df[['f1','f2','f3']] = df.features.apply(pd.Series)
在大熊猫中不起作用。
我可以像
这样的循环遍历for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])
但是在实际的用例中,我具有成千上万的功能,因此遍历了数据集数千次。
达到预期结果的最佳方法是什么?
答案 0 :(得分:0)
In [68]: import string
...: import numpy as np
...: import pandas as pd
In [69]: M, N = 100, 100
...: labels = np.random.choice(['good', 'bad'], size=M)
...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
...: features = np.empty((M,), dtype=object)
...: features[:] = list(map(list, np.random.randn(M, N)))
...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
...: df1 = df.copy()
In [70]: %%time
...: columns = [f"f{i:04d}" for i in range(N)]
...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
...: df1 = pd.concat([df1, features], axis=1)
Wall time: 13.9 ms
In [71]: M, N = 1000, 1000
...: labels = np.random.choice(['good', 'bad'], size=M)
...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
...: features = np.empty((M,), dtype=object)
...: features[:] = list(map(list, np.random.randn(M, N)))
...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
...: df1 = df.copy()
In [72]: %%time
...: columns = [f"f{i:04d}" for i in range(N)]
...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
...: df1 = pd.concat([df1, features], axis=1)
Wall time: 627 ms
In [73]: df1.shape
Out[73]: (1000, 1002)
编辑: 比原始速度快2倍
In [79]: df2 = df.copy()
In [80]: %%time
...: features = df2.pop('features')
...: for i in range(N):
...: df2[f'f{i:04d}'] = features.map(lambda x: x[i])
...:
Wall time: 1.46 s
In [81]: df1.equals(df2)
Out[81]: True
编辑:编辑: 一种更快的构造DataFrame的方法比原始方法提高了8倍:
In [22]: df1 = df.copy()
In [23]: %%time
...: features = pd.DataFrame({f"f{i:04d}": np.asarray(row) for i, row in enumerate(df1.pop('features').to_numpy())})
...: df1 = pd.concat([df1, features], axis=1)
Wall time: 165 ms