简单问题陈述：

Question

简单问题陈述：

是否可以在 C 或Cython中拥有自定义大小数据类型（3/5/6/7字节）的数组？

背景：

在尝试编写复杂算法时，我遇到了主要的内存效率低下问题。该算法要求存储令人兴奋的数据量。所有数据都排列在一个连续的内存块中（如数组）。数据只是一个很长的[通常]非常大的数字列表。给定一组特定数字时，此列表/数组中的数字类型是常量（它们几乎作为常规C数组运行，其中所有数字在数组中的类型相同）

问题：

有时，将每个数字存储在标准数据大小中效率不高。通常正常的数据类型是char，short，int，long等...但是如果我使用int数组来存储一个数据类型，该数据类型只能存储在3个字节的范围内，那么在每个数字上我丢失1个字节空间。这会导致效率极低，当你存储数百万个数字时，效果会破坏内存。遗憾的是，没有其他方法可以实现算法的解决方案，我相信自定义数据大小的粗略实现是唯一的方法。

我尝试了什么：

我曾尝试使用char数组来完成此任务，但在大多数情况下，在不同的0 - 255值位之间进行转换以形成更大的数据类型效率很低。通常情况下，有一种数学方法可以将字符取出并将它们打包成更大的数字，或者取出更大的数字，然后将其各个字符分开。这是一个非常低效的算法，用Cython编写：

def to_bytes(long long number, int length):
    cdef:
        list chars = []
        long long m
        long long d

    for _ in range(length):
        m = number % 256
        d = number // 256
        chars.append(m)
        number = d

    cdef bytearray binary = bytearray(chars)
    binary = binary[::-1]
    return binary

def from_bytes(string):
    cdef long long d = int(str(string).encode('hex'), 16)
    return d

请记住，我并不完全希望改进此算法，但这是一种声明某种数据类型数组的基本方法，因此我不必进行此转换。

Answer 1

在 C 中，您可以定义自定义数据类型以处理具有任意字节大小的复杂性：

typedef struct 3byte { char x[3]; } 3byte;

然后，您可以执行所有好处，例如按值传递，获取正确的size_t，以及创建此类型的数组。

Answer 2

您可以使用打包的位域。在GCC上，这看起来像

typedef struct __attribute__((__packed__)) {
    int x : 24;
} int24;

对于int24 x，x.x的行为非常类似于24位int。你可以创建一个这样的数组，它不会有任何不必要的填充。请注意，这将比使用普通的int更慢;数据不会对齐，我不认为有任何24位读取指令。编译器需要为每个读取和存储生成额外的代码。

Answer 3

MrAlias＆amp;用户都提出了好处，为什么不把它们结合起来呢？

typedef union __attribute__((__packed__)) {
  int x : 24;
  char s[3];
} u3b;

typedef union __attribute__((__packed__)) {
  long long x : 56;
  char s[7];
} u7b;

对于大量数据，可能会以这种方式保存一些内存，但由于它会产生未对齐的访问，因此代码几乎肯定会变慢。为了获得最高效率，您应该扩展它们以对齐标准积分长度并对它们进行操作（读取数组为4或8的倍数）。

然后你仍然会遇到字节序问题，所以如果你需要兼容大端和小端，那么就需要使用union的char部分来容纳数据不适合的平台（联盟）只适用于一种类型的字节序）。对于另一个字节序，你需要的东西是：

int x = myu3b.s[0]|(myu3b.s[1]<<8)|(myu3b.s[2]<<16);
//or
int x = myu3b.s[2]|(myu3b.s[1]<<8)|(myu3b.s[0]<<16);

此方法在优化（依赖于编译器）后可能同样快，如果是这样，您可以只使用char数组并完全跳过联合。

Answer 4

我完全支持bit-set方法，只需注意对齐问题。如果您进行大量随机访问，则可能需要确保与缓存+ cpu体系结构保持一致。

此外，我建议调查另一种方法：

您可以使用例如zlib动态解压缩所需的数据。如果您希望流中存在大量重复值，则可以显着减少IO流量以及内存占用。（假设随机访问的需求不是太大。）See here有关zlib的快速教程。

Answer 5

我认为重要的问题是您是否需要同时访问所有数据。

如果您只需要同时访问一个数据块

如果您一次只需要访问一个数组，那么一个Pythonic可能就是根据需要使用数据类型为uint8和宽度的NumPy数组。当您需要对数据进行操作时，您将压缩数据扩展（此处为3个八位字节的数字为uint32）：

import numpy as np

# in this example `compressed` is a Nx3 array of octets (`uint8`)
expanded = np.empty((compressed.shape[0], 4))
expanded[:,:3] = compressed
expanded[:, 3] = 0
expanded = expanded.view('uint32').reshape(-1)

然后在expanded上执行操作，这是一个N uint32值的1-d向量。

完成后，数据可以保存回来：

# recompress
compressed[:] = expanded.view('uint8').reshape(-1,4)[:,:3]

对于上面的示例，每个方向所花费的时间（在我的Python机器中）大约为每个元素8 ns。使用Cython可能不会在这里提供太多的性能优势，因为几乎所有的时间都花在NumPy黑暗深处的缓冲区之间复制数据。

这是一次性成本很高，但如果您计划至少访问一次元素，那么支付一次性成本可能比每次操作的类似成本更便宜。

当然，可以在C中采用相同的方法：

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/resource.h>

#define NUMITEMS 10000000

int main(void)
    {
    uint32_t *expanded;
    uint8_t * cmpressed, *exp_as_octets;
    struct rusage ru0, ru1;
    uint8_t *ep, *cp, *end;
    double time_delta;

    // create some compressed data
    cmpressed = (uint8_t *)malloc(NUMITEMS * 3);

    getrusage(RUSAGE_SELF, &ru0);

    // allocate the buffer and copy the data
    exp_as_octets = (uint8_t *)malloc(NUMITEMS * 4);
    end = exp_as_octets + NUMITEMS * 4;
    ep = exp_as_octets;
    cp = cmpressed;
    while (ep < end)
        {
        // copy three octets out of four
        *ep++ = *cp++;
        *ep++ = *cp++;
        *ep++ = *cp++;
        *ep++ = 0;
        }
    expanded = (uint32_t *)exp_as_octets;

    getrusage(RUSAGE_SELF, &ru1);
    printf("Uncompress\n");
    time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6 
               - ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
    printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
    time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6 
               - ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
    printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);

    getrusage(RUSAGE_SELF, &ru0);
    // compress back
    ep = exp_as_octets;
    cp = cmpressed;
    while (ep < end)
       {
       *cp++ = *ep++;
       *cp++ = *ep++;
       *cp++ = *ep++;
       ep++;
       }
    getrusage(RUSAGE_SELF, &ru1);
    printf("Compress\n");
    time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6 
               - ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
    printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
    time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6 
               - ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
    printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
    }

报告：

Uncompress
 User: 0.022650 seconds, 2.27 nanoseconds per element
 System: 0.016171 seconds, 1.62 nanoseconds per element
Compress
 User: 0.011698 seconds, 1.17 nanoseconds per element
 System: 0.000018 seconds, 0.00 nanoseconds per element

代码是用gcc -Ofast编译的，可能与最佳速度相当接近。系统时间花费在malloc上。在我看来，这看起来非常快，因为我们正在以2-3 GB / s的速度进行内存读取。（这也意味着虽然使代码多线程变得容易，但速度可能没有太大的好处。）

如果要获得最佳性能，则需要分别为每个数据宽度编写压缩/解压缩例程。（我不保证上面的C代码在任何机器上都是绝对最快的，我没看过机器代码。）

如果您需要随机访问单独的值

如果你需要在这里只访问一个值，那么Python就不会提供任何合理的快速方法，因为数组查找开销很大。

在这种情况下，我建议你创建C例程来获取并放回数据。请参阅technosaurus的回答。有很多技巧，但是对齐问题是无法避免的。

读取奇数大小的数组时的一个有用技巧可能是（此处从八位字节数组compressed读取3个八位字节到uint32_t value）：

value = (uint32_t *)&compressed[3 * n] & 0x00ffffff;

然后其他人会处理可能的错位，最后会有一个八位字节的垃圾。不幸的是，在写入值时不能使用它。并且 - 再次 - 这可能会或可能不会比任何其他替代方案更快或更慢。

Answer 6

根据处理器可以破解指令的速度，我对一般人如何做到这一点感兴趣，并且仍然在合理的时间内运行。

packed位字段的问题在于它们不是标准的，并且不适用于不同端点的机器上的读/写。在我看来，小端只是解决这个问题的原因......所以想要解决端点问题的好处，诀窍似乎是存储小端的东西。比方说，5字节整数：存储一个小端值很简单，你只需复制前5个字节;加载不是那么简单，因为你必须签署扩展。

下面的代码将执行2,3,4和5字节有符号整数的数组：（a）强制小端，以及（b）使用packed位字段进行比较（参见BIT_FIELD ）。如上所述，它在linux（64位）上的gcc下编译。

该代码有两个飞行假设：

-ve number是2或1的补码（没有符号和幅度）！
对于任何大小的结构，可以在任何地址读取/写入具有对齐== 1的结构。

main进行一些测试和计时。它在大型数组上运行相同的测试：（a）'flex'数组，整数长度为2,3,4和5;和（b）整数长度为2,4,4和8的简单数组。在我的机器上，我得到了（编译-O3，最大化优化）：

Arrays of 800 million entries -- not using bit-field
With 'flex' arrays of 10.4G bytes: took 20.160 secs: user 16.600 system 3.500
With simple arrays of 13.4G bytes: took 32.580 secs: user 14.680 system 4.910

Arrays of 800 million entries -- using bit-field
With 'flex' arrays of 10.4G bytes: took 22.280 secs: user 18.820 system 3.380
With simple arrays of 13.4G bytes: took 20.450 secs: user 14.450 system 4.620

因此，使用合理的通用代码，特殊长度整数需要更长时间，但可能没有人们预期的那么糟糕！比特字段版本变得更慢......我没有时间深入研究原因。

所以...看起来对我来说很可行。

/*==============================================================================
 * 2/3/4/5/... byte "integers" and arrays thereof.
 */
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stddef.h>
#include <unistd.h>
#include <memory.h>
#include <stdio.h>
#include <sys/times.h>
#include <assert.h>

/*==============================================================================
 * General options
 */
#define BIT_FIELD 0             /* use bit-fields (or not)  */

#include <endian.h>
#include <byteswap.h>

#if __BYTE_ORDER == __LITTLE_ENDIAN
# define htole16(x) (x)
# define le16toh(x) (x)

# define htole32(x) (x)
# define le32toh(x) (x)

# define htole64(x) (x)
# define le64toh(x) (x)

#else
# define htole16(x) __bswap_16 (x)
# define le16toh(x) __bswap_16 (x)

# define htole32(x) __bswap_32 (x)
# define le32toh(x) __bswap_32 (x)

# define htole64(x) __bswap_64 (x)
# define le64toh(x) __bswap_64 (x)
#endif

typedef int64_t imax_t ;

/*------------------------------------------------------------------------------
 * 2 byte integer
 */
#if BIT_FIELD
typedef struct __attribute__((packed)) { int16_t  i : 2 * 8 ; } iflex_2b_t ;
#else
typedef struct { int8_t b[2] ; } iflex_2b_t ;
#endif

inline static int16_t
iflex_get_2b(iflex_2b_t item)
{
#if BIT_FIELD
  return item.i ;
#else
  union
  {
    int16_t     i ;
    iflex_2b_t  f ;
  } x ;

  x.f = item ;
  return le16toh(x.i) ;
#endif
} ;

inline static iflex_2b_t
iflex_put_2b(int16_t val)
{
#if BIT_FIELD
  iflex_2b_t x ;
  x.i = val ;
  return x ;
#else
  union
  {
    int16_t     i ;
    iflex_2b_t  f ;
  } x ;

  x.i = htole16(val) ;
  return x.f ;
#endif
} ;

/*------------------------------------------------------------------------------
 * 3 byte integer
 */
#if BIT_FIELD
typedef struct __attribute__((packed)) { int32_t  i : 3 * 8 ; } iflex_3b_t ;
#else
typedef struct { int8_t b[3] ; } iflex_3b_t ;
#endif

inline static int32_t
iflex_get_3b(iflex_3b_t item)
{
#if BIT_FIELD
  return item.i ;
#else
  union
  {
    int32_t     i ;
    int16_t     s[2] ;
    iflex_2b_t  t[2] ;
  } x ;

  x.t[0] = *((iflex_2b_t*)&item) ;
  x.s[1] = htole16(item.b[2]) ;

  return le32toh(x.i) ;
#endif
} ;

inline static iflex_3b_t
iflex_put_3b(int32_t val)
{
#if BIT_FIELD
  iflex_3b_t x ;
  x.i = val ;
  return x ;
#else
  union
  {
    int32_t     i ;
    iflex_3b_t  f ;
  } x ;

  x.i = htole32(val) ;
  return x.f ;
#endif
} ;

/*------------------------------------------------------------------------------
 * 4 byte integer
 */
#if BIT_FIELD
typedef struct __attribute__((packed)) { int32_t  i : 4 * 8 ; } iflex_4b_t ;
#else
typedef struct { int8_t b[4] ; } iflex_4b_t ;
#endif

inline static int32_t
iflex_get_4b(iflex_4b_t item)
{
#if BIT_FIELD
  return item.i ;
#else
  union
  {
    int32_t     i ;
    iflex_4b_t  f ;
  } x ;

  x.f = item ;
  return le32toh(x.i) ;
#endif
} ;

inline static iflex_4b_t
iflex_put_4b(int32_t val)
{
#if BIT_FIELD
  iflex_4b_t x ;
  x.i = val ;
  return x ;
#else
  union
  {
    int32_t     i ;
    iflex_4b_t  f ;
  } x ;

  x.i = htole32((int32_t)val) ;
  return x.f ;
#endif
} ;

/*------------------------------------------------------------------------------
 * 5 byte integer
 */
#if BIT_FIELD
typedef struct __attribute__((packed)) { int64_t  i : 5 * 8 ; } iflex_5b_t ;
#else
typedef struct { int8_t b[5] ; } iflex_5b_t ;
#endif

inline static int64_t
iflex_get_5b(iflex_5b_t item)
{
#if BIT_FIELD
  return item.i ;
#else
  union
  {
    int64_t     i ;
    int32_t     s[2] ;
    iflex_4b_t  t[2] ;
  } x ;

  x.t[0] = *((iflex_4b_t*)&item) ;
  x.s[1] = htole32(item.b[4]) ;

  return le64toh(x.i) ;
#endif
} ;

inline static iflex_5b_t
iflex_put_5b(int64_t val)
{
#if BIT_FIELD
  iflex_5b_t x ;
  x.i = val ;
  return x ;
#else
  union
  {
    int64_t     i ;
    iflex_5b_t  f ;
  } x ;

  x.i = htole64(val) ;
  return x.f ;
#endif
} ;

/*------------------------------------------------------------------------------
 *
 */
#define alignof(t) __alignof__(t)

/*==============================================================================
 * To begin at the beginning...
 */
int
main(int argc, char* argv[])
{
  int count = 800 ;

  assert(sizeof(iflex_2b_t)  == 2) ;
  assert(alignof(iflex_2b_t) == 1) ;
  assert(sizeof(iflex_3b_t)  == 3) ;
  assert(alignof(iflex_3b_t) == 1) ;
  assert(sizeof(iflex_4b_t)  == 4) ;
  assert(alignof(iflex_4b_t) == 1) ;
  assert(sizeof(iflex_5b_t)  == 5) ;
  assert(alignof(iflex_5b_t) == 1) ;

  clock_t at_start_clock, at_end_clock ;
  struct tms at_start_tms, at_end_tms ;
  clock_t ticks ;

  printf("Arrays of %d million entries -- %susing bit-field\n", count,
                                                      BIT_FIELD ? "" : "not ") ;
  count *= 1000000 ;

  iflex_2b_t* arr2 = malloc(count * sizeof(iflex_2b_t)) ;
  iflex_3b_t* arr3 = malloc(count * sizeof(iflex_3b_t)) ;
  iflex_4b_t* arr4 = malloc(count * sizeof(iflex_4b_t)) ;
  iflex_5b_t* arr5 = malloc(count * sizeof(iflex_5b_t)) ;

  size_t bytes = ((size_t)count * (2 + 3 + 4 + 5)) ;

  srand(314159) ;

  at_start_clock = times(&at_start_tms) ;

  for (int i = 0 ; i < count ; i++)
    {
      imax_t v5, v4, v3, v2, r ;

      v2 = (int16_t)(rand() % 0x10000) ;
      arr2[i] = iflex_put_2b(v2) ;

      v3 = (v2 * 0x100) | ((i & 0xFF) ^ 0x33) ;
      arr3[i] = iflex_put_3b(v3) ;

      v4 = (v3 * 0x100) | ((i & 0xFF) ^ 0x44) ;
      arr4[i] = iflex_put_4b(v4) ;

      v5 = (v4 * 0x100) | ((i & 0xFF) ^ 0x55) ;
      arr5[i] = iflex_put_5b(v5) ;

      r = iflex_get_2b(arr2[i]) ;
      assert(r == v2) ;

      r = iflex_get_3b(arr3[i]) ;
      assert(r == v3) ;

      r = iflex_get_4b(arr4[i]) ;
      assert(r == v4) ;

      r = iflex_get_5b(arr5[i]) ;
      assert(r == v5) ;
    } ;

  for (int i = count - 1 ; i >= 0 ; i--)
    {
      imax_t v5, v4, v3, v2, r, b ;

      v5 = iflex_get_5b(arr5[i]) ;
      b  = (i & 0xFF) ^ 0x55 ;
      assert((v5 & 0xFF) == b) ;
      r  = (v5 ^ b) / 0x100 ;

      v4 = iflex_get_4b(arr4[i]) ;
      assert(v4 == r) ;
      b  = (i & 0xFF) ^ 0x44 ;
      assert((v4 & 0xFF) == b) ;
      r  = (v4 ^ b) / 0x100 ;

      v3 = iflex_get_3b(arr3[i]) ;
      assert(v3 == r) ;
      b  = (i & 0xFF) ^ 0x33 ;
      assert((v3 & 0xFF) == b) ;
      r  = (v3 ^ b) / 0x100 ;

      v2 = iflex_get_2b(arr2[i]) ;
      assert(v2 == r) ;
    } ;

  at_end_clock  = times(&at_end_tms) ;

  ticks = sysconf(_SC_CLK_TCK) ;

  printf("With 'flex' arrays of %4.1fG bytes: "
                                  "took %5.3f secs: user %5.3f system %5.3f\n",
      (double)bytes / (double)(1024 *1024 *1024),
      (double)(at_end_clock - at_start_clock)                 / (double)ticks,
      (double)(at_end_tms.tms_utime - at_start_tms.tms_utime) / (double)ticks,
      (double)(at_end_tms.tms_stime - at_start_tms.tms_stime) / (double)ticks) ;

  free(arr2) ;
  free(arr3) ;
  free(arr4) ;
  free(arr5) ;

  int16_t* brr2 = malloc(count * sizeof(int16_t)) ;
  int32_t* brr3 = malloc(count * sizeof(int32_t)) ;
  int32_t* brr4 = malloc(count * sizeof(int32_t)) ;
  int64_t* brr5 = malloc(count * sizeof(int64_t)) ;

  bytes = ((size_t)count * (2 + 4 + 4 + 8)) ;

  srand(314159) ;

  at_start_clock = times(&at_start_tms) ;

  for (int i = 0 ; i < count ; i++)
    {
      imax_t v5, v4, v3, v2, r ;

      v2 = (int16_t)(rand() % 0x10000) ;
      brr2[i] = v2 ;

      v3 = (v2 * 0x100) | ((i & 0xFF) ^ 0x33) ;
      brr3[i] = v3 ;

      v4 = (v3 * 0x100) | ((i & 0xFF) ^ 0x44) ;
      brr4[i] = v4 ;

      v5 = (v4 * 0x100) | ((i & 0xFF) ^ 0x55) ;
      brr5[i] = v5 ;

      r = brr2[i] ;
      assert(r == v2) ;

      r = brr3[i] ;
      assert(r == v3) ;

      r = brr4[i] ;
      assert(r == v4) ;

      r = brr5[i] ;
      assert(r == v5) ;
    } ;

  for (int i = count - 1 ; i >= 0 ; i--)
    {
      imax_t v5, v4, v3, v2, r, b ;

      v5 = brr5[i] ;
      b  = (i & 0xFF) ^ 0x55 ;
      assert((v5 & 0xFF) == b) ;
      r  = (v5 ^ b) / 0x100 ;

      v4 = brr4[i] ;
      assert(v4 == r) ;
      b  = (i & 0xFF) ^ 0x44 ;
      assert((v4 & 0xFF) == b) ;
      r  = (v4 ^ b) / 0x100 ;

      v3 = brr3[i] ;
      assert(v3 == r) ;
      b  = (i & 0xFF) ^ 0x33 ;
      assert((v3 & 0xFF) == b) ;
      r  = (v3 ^ b) / 0x100 ;

      v2 = brr2[i] ;
      assert(v2 == r) ;
    } ;

  at_end_clock  = times(&at_end_tms) ;

  printf("With simple arrays of %4.1fG bytes: "
                                  "took %5.3f secs: user %5.3f system %5.3f\n",
      (double)bytes / (double)(1024 *1024 *1024),
      (double)(at_end_clock - at_start_clock)                 / (double)ticks,
      (double)(at_end_tms.tms_utime - at_start_tms.tms_utime) / (double)ticks,
      (double)(at_end_tms.tms_stime - at_start_tms.tms_stime) / (double)ticks) ;

  free(brr2) ;
  free(brr3) ;
  free(brr4) ;
  free(brr5) ;

  return 0 ;
} ;

自定义大小数组

简单问题陈述：

背景：

问题：

我尝试了什么：

6 个答案: