Question

NumPy＆＃34;结构化数组＆＃34;，＆＃34;记录数组＆＃34;之间的区别是什么？和＆＃34;重新组合＆＃34;？

NumPy docs暗示前两个是相同的：如果是，那么这个对象的首选术语是什么？

相同的文档说（在页面底部）：您可以找到有关重组和结构化数组的更多信息（包括两者之间的差异）here。对这种差异有一个简单的解释吗？

Answer 1

记录/重组在

中实施

https://github.com/numpy/numpy/blob/master/numpy/core/records.py

此文件中的一些相关引用

记录阵列记录数组将结构化数组的字段显示为属性。重新排列几乎与标准阵列（支持命名字段已经）最大的区别是它可以使用 attribute-lookup用于查找字段，并使用记录。

recarray是ndarray的子类（与matrix和masked arrays的方式相同）。但请注意，它的构造函数与np.array不同。它更像np.empty(size, dtype)。

class recarray(ndarray):
    """Construct an ndarray that allows field access using attributes.
    This constructor can be compared to ``empty``: it creates a new record
       array but does not fill it with data.

将唯一字段实现为属性行为的关键功能是__getattribute__（__getitem__实现索引）：

def __getattribute__(self, attr):
    # See if ndarray has this attr, and return it if so. (note that this
    # means a field with the same name as an ndarray attr cannot be
    # accessed by attribute).
    try:
        return object.__getattribute__(self, attr)
    except AttributeError:  # attr must be a fieldname
        pass

    # look for a field with this name
    fielddict = ndarray.__getattribute__(self, 'dtype').fields
    try:
        res = fielddict[attr][:2]
    except (TypeError, KeyError):
        raise AttributeError("recarray has no attribute %s" % attr)
    obj = self.getfield(*res)

    # At this point obj will always be a recarray, since (see
    # PyArray_GetField) the type of obj is inherited. Next, if obj.dtype is
    # non-structured, convert it to an ndarray. If obj is structured leave
    # it as a recarray, but make sure to convert to the same dtype.type (eg
    # to preserve numpy.record type if present), since nested structured
    # fields do not inherit type.
    if obj.dtype.fields:
        return obj.view(dtype=(self.dtype.type, obj.dtype.fields))
    else:
        return obj.view(ndarray)

它首先尝试获取常规属性 - 例如.shape，.strides，.data以及所有方法（.sum，{{1}等等）。如果失败，它会在.reshape字段名称中查找名称。所以它实际上只是一个带有一些重新定义的访问方法的结构化数组。

我最好告诉dtype和record array是一样的。

另一个文件显示历史记录

https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py

用于操作结构化数组的实用程序的集合。大多数这些功能最初由John Hunter实施 matplotlib。为方便起见，它们已被重写和扩展。

此文件中的许多功能都以：

结尾

recarray

您可以将某个数组作为if asrecarray: output = output.view(recarray)视图返回的事实显示了如何“瘦”＆＃39;这一层是。

recarray历史悠久，并合并了几个独立的项目。我的印象是，numpy是一个较旧的想法，结构化数组是基于广义recarray构建的当前实现。 dtype似乎是为了方便和向后兼容而保留的。但是我必须研究recarrays文件历史记录，以及任何最近的问题/拉取请求以确定。

Answer 2

简而言之，答案是您通常应该使用结构化数组而不是recarrays，因为结构化数组更快，并且recarrays的唯一优点是允许您编写arr.x而不是arr['x']，这可以是一种方便的快捷方式，但是如果您的列名与numpy方法/属性冲突，则也容易出错。

请参阅@jakevdp书中的excerpt，以获取更详细的说明。特别是，他指出，简单地访问结构化数组的列可能比访问recarray的列快20到30倍。但是，他的示例使用的数据帧非常小，只有4行，并且不执行任何标准操作。

对于大数据帧的简单操作，尽管结构化数组仍然更快，但差异可能会小得多。例如，这是一个结构化的记录数组，每个数组有10,000行（通过从@jpp答案here借来的数据帧中创建数组的代码）。

n = 10_000
df = pd.DataFrame({ 'x':np.random.randn(n) })
df['y'] = df.x.astype(int)

rec_array = df.to_records(index=False)

s = df.dtypes
struct_array = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))

如果我们执行标准操作（例如将一列乘以2），则结构化数组的速度要快50％：

%timeit struct_array['x'] * 2
9.18 µs ± 88.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit rec_array.x * 2
14.2 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

NumPy＆＃34;记录阵列＆＃34;或＆＃34;结构化阵列＆＃34;或者＆＃34;重新组合＆＃34;

2 个答案: