Question

我有一个numpy structured array，它有整数和浮点数，我用它来初始化pandas DataFrame：

In [497]: x = np.ones(100000000, dtype=[('f0', '<i8'), ('f1', '<f8'),('f2','<i8'),('f3', '<f8'),('f4', '<f8'),('f5', '<f8'),('f6', '<f8'),('f7', '<f8')])

In [498]: %timeit pd.DataFrame(x)
The slowest run took 4.07 times longer than the fastest. This could mean that an intermediate result is being cached 

In [498]: 1 loops, best of 3: 2min 26s per loop


In [499]: xx=x.view(np.float64).reshape(x.shape + (-1,))

In [500]: %timeit pd.DataFrame(xx)
1 loops, best of 3: 256 ms per loop

从上面的代码中可以看出，使用DataFrame初始化structured array的速度非常慢。但是，如果我将数据更改为连续浮动numpy数组，则速度很快。但是我仍然需要DataFrame混合使用浮点数和整数。

经过一些测试，我意识到DataFrame实际上正在复制整个structured array（当使用structured array float视图进行初始化时不会发生这种情况）。我在这里找到了更多信息：https://github.com/pydata/pandas/issues/9216

无论如何都要加快初始化并避免复制？我对其他方法持开放态度，但数据来自structured array。

使用numpy结构化数组初始化时，pandas DataFrame非常慢

0 个答案: