Question

过滤数组数组的最有效方法是什么？让我们举个例子：

[
   [key,timestamp,value],
   [2,12211212440,3.98],
   [2,12211212897,3.78],
   ...,
   [3,12211212440,3.28],
   [3,12211212897,3.58],
   ...,
   [4,12211212440,4.98],
   ... about 1 millions row :-)
]

所以，让我们想象一下我们有超过一百万行。我希望按键和时间戳过滤，如数组[3,7,9，...]中的键和time1和time2之间的时间戳。

当然，第一个想法是使用数据库，但出于某种原因，我想直接在应用程序内部而不是第三方软件中进行。

因此，我想知道对大量数据执行此过滤的最有效方法是什么。 C ++列表STL？ Python Numpy？ ...？但Numpy基于C / C ++，因此我认为最有效的方法是C / C ++。

你能给我一些好主意吗？

Answer 1

Well, if you really want to absolutely optimize the performance here, the answer is pretty much always to hand-code in C/C++. But optimizing the run-time performance usually means a lot longer to write the code, so you need to figure out that tradeoff for yourself.

But you're right: numpy is indeed written in C, and most of its processing happens at that level. As a result, the algorithms it implements are typically very close to the speed you could get by coding them yourself in C (or faster, because people have been working on numpy for a while). So you probably shouldn't bother re-coding anything you can find in numpy.

Now the question is whether or not numpy has a built-in way of doing what you want. Here's one way to do it:

import numpy as np
keys = np.array([2, 4])
t1 = 12211212440
t2 = 12211212897
d = np.array([[key,timestamp,value]
              for key in range(5)
              for timestamp in range(12211172430, 12211212997+1)
              for value in [3.98, 3.78, 3.28, 3.58, 4.98]])
filtered = d[(d[:, 1] >= t1) & (d[:, 1] <= t2)]  # Select rows in time span
filtered = filtered[np.in1d(filtered[:, 0], keys)]  # Select rows where key is in keys

It's not quite built-in, but two lines isn't bad. Note that this d I made up has a little over a million entries. On my laptop, this runs in around 4 milliseconds, which is around 4 nanoseconds per entry. That's almost as fast as I would expect to be able to do it in C/C++.

Of course, there are other ways to do it. Also, you might want to make a custom numpy dtype to handle the fact that you evidently have int, int, float, whereas numpy assumes float, float, float. If I use a numpy recarray to do this, I get about a 50% speedup. There's also pandas, which can handle that automatically, and has lots of selection functions. In particular, translating the above straight to pandas runs in a little under 4 milliseconds -- maybe pandas is more clever about something:

import pandas as pd
keys = [2, 4]
t1 = 12211212440
t2 = 12211212897
d_pandas = pd.DataFrame([(key,timestamp,value)
                         for key in range(5)
                         for timestamp in range(12211172430, 12211212997+1)
                         for value in [3.98, 3.78, 3.28, 3.58, 4.98]],
                        columns=['key', 'timestamp', 'value'])
filtered = d_pandas[(d_pandas.timestamp >= t1) & (d_pandas.timestamp <= t2)]
filtered = filtered[filtered.timestamp.isin(keys)]

Of course, this isn't how I would code it in C++, so you could do better in principle. In particular, I would expect to see a slowdown due to looping over the array two or three separate times; in C++ I would just do the loop once, check one condition, if that's true check the second condtion, if that's true keep that row. If you're already using python, you could also look into numba, which lets you use python, but compiles it on the fly for you, so it's about as fast as C/C++. Even a code as dumb as this runs very quickly:

import numba as nb
import numpy as np

keys = np.array([2, 4], dtype='i8')
t1 = 12211212440
t2 = 12211212897
d_recarray = np.rec.array([(key,timestamp,value)
                           for key in range(5)
                           for timestamp in range(12211172430, 12211212997+1)
                           for value in [3.98, 3.78, 3.28, 3.58, 4.98]],
                          dtype=[('key', 'i8'), ('timestamp', 'i8'), ('value', 'f8')])

@nb.njit
def select_elements_recarrray(d_in, keys, t1, t2):
    d_out = np.empty_like(d_in)
    k = 0
    for i in range(len(d_in)):
        if d_in[i].timestamp >= t1 and d_in[i].timestamp <= t2:
            matched_key = False
            for j in range(len(keys)):
                if d_in[i].key == keys[j]:
                    matched_key = True
                    break
            if matched_key:
                d_out[k] = d_in[i]
                k += 1
    return d_out[:k]

filtered = select_elements_recarrray(d_recarray, keys, t1, t2)

The jit compilation takes a little time, though much less time than compiling a typical C code, and it also only has to happen once. And then the filtering runs in just over one millisecond on my laptop -- almost four times faster than the numpy code, and about one nanosecond per input array element. This is roughly as fast as I would expect to be able to do anything, even in C/C++. So I don't expect it would be worth your time optimizing this.

I suggest you try something like this because it's so short and fast (and done for you). Then, only if it's too slow, put in the work to do something better. I don't know what you're doing, but chances are this will not make up a large fraction of the time your code takes to run.

Answer 2

使用C进行Quick'n'dirty过滤，单线程，仅依靠操作系统进行缓冲等：

#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <assert.h>
#include <time.h>

double const fromTime = 3.45f;
double const toTime = 11.5f;
unsigned const someKeys[] = {21, 42, 63, 84};

bool matchingKey(unsigned const key) {
  size_t i;
  for (i = 0; i < sizeof(someKeys) / sizeof(someKeys[0]); ++i) {
    if (someKeys[i] == key) {
      return true;
    }
  }
  return false;
}

bool matchingTime(double const time) {
  return (time >= fromTime) && (time < toTime);
}

int main() {
  char buffer[100];
  time_t start_time = time(NULL);
  clock_t start_clock = clock();
  while (fgets(buffer, 100, stdin) != NULL) {
    size_t const lineLength = strlen(buffer);
    assert(lineLength > 0);
    if (buffer[lineLength - 1] != '\n') {
      fprintf(stderr, "Line too long:\n%s\nBye!\n", buffer);
      return 1;
    }
    unsigned key;
    double timestamp;
    if (sscanf(buffer, "%u %lf", &key, &timestamp) != 2) {
      fprintf(stderr, "Failed to parse line:\n%sBye!\n", buffer);
    }
    if (matchingTime(timestamp) && matchingKey(key)) {
      printf("%s", buffer);
    }
  }
  time_t end_time = time(NULL);
  clock_t end_clock = clock();
  fprintf(stderr, "time: %lf clock: %lf\n",
      difftime(end_time, start_time),
      (double) (end_clock - start_clock) / CLOCKS_PER_SEC);
  if (!feof(stdin)) {
    fprintf(stderr, "Something bad happend, bye!\n");
    return 1;
  }
  return 0;
}

有很多可以在这里优化，但让我们来看看它是如何运行的：

$ clang -Wall -Wextra -pedantic -O2 generate.c -o generate
$ clang -Wall -Wextra -pedantic -O2 filter.c -o filter
$ ./generate | time ./filter > /dev/null 
time: 67.000000 clock: 66.535009
66.08user 0.45system 1:07.16elapsed 99%CPU (0avgtext+0avgdata 596maxresident)k
0inputs+0outputs (0major+178minor)pagefaults 0swaps

这是过滤1亿个随机行的结果。我在带有Intel P6100的笔记本电脑上运行它，所以没什么高端的。

每行大约670纳秒。请注意，您无法将此与Mike的结果进行比较，因为一方面（当然）这取决于系统（主要是CPU）的可用性能，我的测试包括操作系统（Linux x86-64）转发的时间。数据通过管道以及解析数据并再次打印。

此处参考generate的代码：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

unsigned randomKey(void) {
  return rand() % 128;
}

double randomTime(void) {
  double a = rand(), b = rand();
  return (a > b) ? (a / b) : (b / a);
}


int main() {
  unsigned long const rows = 100000000UL;
  srand(time(NULL));
  unsigned long r;
  for (r = 0; r < rows; ++r) {
    printf("%u %lf\n", randomKey(), randomTime());
  }
  return 0;
}

阵列数组和过滤性能

2 个答案: