Question

我有以下代码

import numpy as np
import pandas as pd

test_array = np.array([(1, 2, 3), (4, 5, 6)], 
                      dtype={'names': ('a', 'b', 'c'), 'formats': ('f8', 'f8', 'f8')})
test_df = pd.DataFrame.from_records(test_array)
test_df.to_records().view(np.float64).reshape(test_array.shape + (-1, ))

我希望返回原始test_array的视图，形状为(2, 3)，但是，我得到了这个(2, 4)数组。

rec.array([[0.e+000, 1.e+000, 2.e+000, 3.e+000],
           [5.e-324, 4.e+000, 5.e+000, 6.e+000]],
          dtype=float64)

多余的列（第0列）从哪里来？

编辑：我刚刚得知我可以使用DataFrame.values()做同样的事情，但是我仍然很好奇为什么存在这种行为。

Answer 1

如果需要记录数组，请使用np.rec.fromrecords：

np.rec.fromrecords(test_df, names=[*test_df])
# rec.array([(1., 2., 3.), (4., 5., 6.)],
#          dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

我的测试表明，这比df.to_records快一些。

Answer 2

to_records也在捕获索引。请注意，这在docs中说明：

如果需要，索引将作为记录数组的第一个字段包括在内

如果要排除它，只需设置index=False。

尽管根据您的情况，您可以简单地使用to_numpy（或values）：

test_df.to_numpy().view(np.float64).reshape(test_array.shape + (-1, ))

array([[1., 2., 3.],
       [4., 5., 6.]])

Answer 3

在index=False中设置to_records：

test_df.to_records(index=False).view(np.float64).reshape(test_array.shape + (-1, ))

将pandas DataFrame转换为记录数组而无需额外的列

3 个答案: