Question

我需要运行以下类型的数百万条查询。

每个输入包含一个小集合（＆lt; 100）的各种大小的布尔向量（<20000个元素），每个都有几个1和多个0：

A = [ 0 0 0 1 0 0 0 0 0 0 0 ... ]
B = [ 0 0 0 0 1 0 ... ]
...

我还有很多（> 20000）布尔AND表达式。这些表达式对于所有查询都是常量。

S[1] = A[10] AND B[52] AND F[15] AND U[2]
S[2] = I[8] AND Z[4]
...

每个表达式可以引用每个向量中的零个或一个元素。变量很少出现在多个表达式中。对于每个查询，output是一组真实表达式。

快速计算查询的好算法是什么，最好比按顺序评估每个表达式更快？算法需要为每个输入运行一次，并且有数百万个输入要运行，因此速度很重要。由于表达式是常量，我可以提前优化它们。我正在和C合作。

Answer 1

早点回来。一旦找到假布尔值，您就会知道and表达式将返回false，因此请不要检查其余内容。

在C中，默认情况下，您会在硬编码的布尔表达式中获得此行为：

(A[10] && B[52] && F[15] && U[2])

根据输入的可预测程度，您可能会在每个表达式变量的排序中获得很多性能，而不是每次出现错误的可能性，并且最不可能重新排序表达式。

Answer 2

You seem to be using lots of data. It's a guess, but I'd say you'll get optimal behavior by preprocessing your expressions into cache optimal passes. Consider the two expressions given:

S[1] = A[10] AND B[52] AND F[15] AND U[2]
S[2] = I[8] AND Z[4]

rewrite these as:

S[1] = 1;
S[1] &= A[10];
S[1] &= B[52];
S[1] &= F[15];
S[1] &= U[2];

S[2] = 1;
S[2] &= I[8];
S[2] &= Z[4];

Then sort all of the expressions together to create one long list of operations:

S[1] = 1;
S[2] = 1;

S[1] &= A[10];
S[1] &= B[52];
S[1] &= F[15];
S[2] &= I[8];
S[1] &= U[2];
S[2] &= Z[4];

Consider the size of the machine cache on hand. We want all of the input vectors in cache. That probably can't happen so we know we will be pulling the input vectors and the result vectors in and out of memory multiple times. We want to partition the available machine cache into three parts: input vector chunk, result vector chunk, and some working space (where our current list of operations will be pulled from).

Now, walk the list of expressions pulling out expressions that fall into the A-I and S[1]-S[400] range. Then walk again pulling J-T (or whatever fits in cache) and pull those operations next, once you get to the end of the operations list repeat for s[401]-s[800]. This is the final order of execution for the operations. Note that this can be parallelized without contention across the S bands.

The down side is that you do not get the early out behavior. The upside is you only have cache failures as you transition blocks of computation. For such a large data set I suspect this (and the elimination of all branching) will overwhelm the early out advantage.

If you still want to try to use the early out optimization you can it is just harder to implement. Consider: once you have your cache bracket A-I & S[1]-s[400], and you have created a list of operations across that bracket:

S[1] &= A[10];
S[1] &= B[52];
S[1] &= F[15];
S[2] &= I[8];

You can then reorder the operations to group them by S[x] (which this example already was). Now if you find A[10] is false you can "early out" to the S[2] block. As far as how to implement this? Well, your operations now need to know how many to skip forward from the current operation:

Operation[x  ] => (S[1] &= A[10], on false, x+=3)
Operation[x+1] => (S[1] &= B[52], on false, x+=2)
Operation[x+2] => (S[1] &= F[15], on false, x+=1)
Operation[x+3] => (S[2] &= I[8]...

Again, I suspect simply adding the branching in will be slower than just performing all of the other work. This is not a full early out process since the when you move to the next input block chunk you'll have to reinspect each S[x] value accessed to determine if it has already failed and should be skipped.

Answer 3

将输入转换为打包形式（非零元素的索引列表）。为了使整个方法比按顺序评估每个表达式更快，你需要使用bit twiddling的编译器内在函数一次处理几个元素（假设每个输入布尔值只占用一个字节，或者甚至更好一位）。
预处理＆＃39; AND＆＃39;表达式到数组将索引从打包输入数组映射到它所属的表达式。（但如果某个变量出现在多个表达式中，则需要进行一些特殊处理。）
将表达式的计数器初始化为0。
读取打包的输入数组并增加相应表达式的计数器。
具有与其术语数量相等的计数器的表达式为“真实”，其他表达式为“假”＆＃39;

Answer 4

我建议您预处理表达式以生成：

包含带有该变量的表达式的每个变量的列表（即A10的列表为[S1，A10的任何其他表达式]）
表达式中变量数量的每个表达式的计数（即S1的计数为4）

然后对每个输入：

将每个表达式的计数初始化为该表达式中的变量总数
循环输入中的所有稀疏设置位，并为每个输入递减包含该位的所有表达式的计数
返回计数达到0的所有表达式。

如何快速评估大量布尔AND？

4 个答案: