Question

我有一个累积（计数）两个文件中包含的字节的脚本。字节是类似C的unsigned char整数值，介于0到255之间。

此累加器脚本的目标是计算这两个文件中的联合计数或字节频率。可能将其扩展到多个文件/维度。

这两个文件大小相同，但它们非常大，大约6 TB左右。

我正在使用numpy.uint64值，因为我使用Python int类型时遇到溢出问题。

我有一个长度为255**2的1D累加器数组，用于存储关节计数。

我计算逐行到阵列偏移计算的偏移量，以便在右侧索引处增加联合频率。我以字节块（n_bytes）遍历两个文件，解压缩它们，并递增频率计数器。

这是代码的粗略草图：

import numpy
import ctypes
import struct

buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
total_buckets = buckets_per_signal_type**2
buckets = numpy.zeros((total_buckets,), dtype=numpy.uint64)

# open file handles to two files (omitted for brevity...)

# buffer size that is known ahead of time to be a divisible 
# unit of the original files 
# (for example, here, reading in 2.4e6 bytes per loop iteration)
n_bytes = 2400000

total_bytes = 0L

# format used to unpack bytes
struct_format = "=%dB" % (n_bytes)

while True:    
    # read in n_bytes chunk from each file
    first_file_bytes = first_file_handle.read(n_bytes)
    second_file_bytes = second_file_handle.read(n_bytes)

    # break if both file handles have nothing left to read
    if len(first_file_bytes) == 0 and len(second_file_bytes) == 0:
        break

    # unpack actual bytes
    first_bytes_unpacked = struct.unpack(struct_format, first_file_bytes)
    second_bytes_unpacked = struct.unpack(struct_format, second_file_bytes)

    for index in range(0, n_bytes):
        first_byte = first_bytes_unpacked[index]
        second_byte = second_bytes_unpacked[index]
        offset = first_byte * buckets_per_signal_type + second_byte
        buckets[offset] += 1

    total_bytes += n_bytes
    # repeat until both file handles are both EOF

# print out joint frequency (omitted)

与我使用int的版本相比，这速度非常慢，速度慢了一个数量级。原始作业在大约8小时内完成（错误地，由于溢出），并且这个基于numpy的版本必须提前退出，因为它似乎需要大约12-14天才能完成。

numpy在这个基本任务中的速度非常慢，或者我没有以类似Python的方式使用numpy做累加器。我怀疑后者，这就是我要求求助的原因。

我读到了numpy.add.at，但我要添加到buckets数组的解压缩字节数组没有自然转换为＆＃34;形状的偏移值。 buckets数组。

有没有办法存储和增加一个（长）整数数组，它不会溢出，而且性能合理？

我可以在C中重写这个，我猜，但是我希望有一些可以忽略的东西，我会很快解决这个问题。谢谢你的建议。

更新

我的numpy和scipy的旧版本不支持numpy.add.at。所以这是另一个需要研究的问题。

我会尝试以下内容，看看情况如何：

first_byte_arr = np.array(first_bytes_unpacked)                                                                                 
second_byte_arr = np.array(second_bytes_unpacked)                                                                                
offsets = first_byte_arr * buckets_per_signal_type + second_byte_arr                                                               
np.add.at(buckets, offsets, 1L)

希望它跑得快一点！

更新II

使用np.add.at和np.array，此作业大约需要12天才能完成。我现在要放弃numpy并回去用C读取原始字节，其中运行时更合理一些。谢谢大家的建议！

Answer 1

如果不尝试按照所有文件读取和struct代码，您似乎要将1添加到buckets数组中的各种插槽中。那部分不应该花费几天时间。

但是为了了解dtype buckets如何影响该步骤，我将测试为随机的各种索引添加1。

In [57]: idx = np.random.randint(0,255**2,10000)
In [58]: %%timeit buckets = np.zeros(255**2, dtype=np.int64)
    ...: for i in idx:
    ...:    buckets[i] += 1
    ...: 
9.38 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [59]: %%timeit buckets = np.zeros(255**2, dtype=np.uint64)
    ...: for i in idx:
    ...:    buckets[i] += 1
    ...: 
71.7 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

uint64慢了约8倍。

如果没有重复，我们可以buckets[idx] += 1。但是允许重复，我们必须使用add.at：

In [60]: %%timeit buckets = np.zeros(255**2, dtype=np.int64)
    ...: np.add.at(buckets, idx, 1)
    ...: 
1.6 ms ± 348 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [61]: %%timeit buckets = np.zeros(255**2, dtype=np.uint64)
    ...: np.add.at(buckets, idx, 1)
    ...: 
1.62 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

有趣的是，dtype uint64不会影响这种情况下的时间安排。

您在评论中提到您尝试过列表累加器。我假设是这样的：

In [62]: %%timeit buckets = [0]*(255**2)
    ...: for i in idx:
    ...:    buckets[i] += 1
    ...: 
3.59 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这比数组的迭代版本更快。通常，对阵列的迭代比列表慢。这是更快的“全阵列”操作，例如add.at。

要验证add.at是迭代的正确替代，请比较

In [63]: buckets0 = np.zeros(255**2, dtype=np.int64)
In [64]: for i in idx: buckets0[i] += 1

In [66]: buckets01 = np.zeros(255**2, dtype=np.int64)
In [67]: np.add.at(buckets01, idx, 1)
In [68]: np.allclose(buckets0, buckets01)
Out[68]: True

In [69]: buckets02 = np.zeros(255**2, dtype=np.int64)
In [70]: buckets02[idx] += 1
In [71]: np.allclose(buckets0, buckets02)
Out[71]: False

In [75]: bucketslist = [0]*(255**2)
In [76]: for i in idx: bucketslist[i] += 1
In [77]: np.allclose(buckets0, bucketslist)
Out[77]: True

Answer 2

numpy在fromfile中有自己的文件I / O方法，如果您想在numpy数组中输出，最好使用它。（见this question）

可能更好地使用array给出的numpy结构来使buckets成为一个二维数组：

buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
buckets = numpy.zeros((buckets_per_signal_type, buckets_per_signal_type), dtype=numpy.uint64)

然后只需使用np.add.at来增加分档

# define record_type to match your data
while True
    data_1 = np.fromfile(first_file_handle, dtype=record_dtype, count=nbytes)
    data_2 = np.fromfile(second_file_handle, dtype=record_dtype, count=nbytes)
    s = np.minimum(data_1.size, data_2.size)
    if s == 0:
        break
    np.add.at(buckets, [data_1[:s], data_2[:s]], 1)

有没有办法增加不慢的numpy数组？

2 个答案: