我需要运行以下类型的数百万条查询。
每个输入包含一个小集合(&lt; 100)的各种大小的布尔向量(<20000个元素),每个都有几个1和多个0:
A = [ 0 0 0 1 0 0 0 0 0 0 0 ... ]
B = [ 0 0 0 0 1 0 ... ]
...
我还有很多(> 20000)布尔AND表达式。这些表达式对于所有查询都是常量。
S[1] = A[10] AND B[52] AND F[15] AND U[2]
S[2] = I[8] AND Z[4]
...
每个表达式可以引用每个向量中的零个或一个元素。变量很少出现在多个表达式中。对于每个查询,output是一组真实表达式。
快速计算查询的好算法是什么,最好比按顺序评估每个表达式更快?算法需要为每个输入运行一次,并且有数百万个输入要运行,因此速度很重要。由于表达式是常量,我可以提前优化它们。我正在和C合作。
答案 0 :(得分:5)
早点回来。一旦找到假布尔值,您就会知道and
表达式将返回false
,因此请不要检查其余内容。
在C中,默认情况下,您会在硬编码的布尔表达式中获得此行为:
(A[10] && B[52] && F[15] && U[2])
根据输入的可预测程度,您可能会在每个表达式变量的排序中获得很多性能,而不是每次出现错误的可能性,并且最不可能重新排序表达式。
答案 1 :(得分:4)
You seem to be using lots of data. It's a guess, but I'd say you'll get optimal behavior by preprocessing your expressions into cache optimal passes. Consider the two expressions given:
S[1] = A[10] AND B[52] AND F[15] AND U[2]
S[2] = I[8] AND Z[4]
rewrite these as:
S[1] = 1;
S[1] &= A[10];
S[1] &= B[52];
S[1] &= F[15];
S[1] &= U[2];
S[2] = 1;
S[2] &= I[8];
S[2] &= Z[4];
Then sort all of the expressions together to create one long list of operations:
S[1] = 1;
S[2] = 1;
S[1] &= A[10];
S[1] &= B[52];
S[1] &= F[15];
S[2] &= I[8];
S[1] &= U[2];
S[2] &= Z[4];
Consider the size of the machine cache on hand. We want all of the input vectors in cache. That probably can't happen so we know we will be pulling the input vectors and the result vectors in and out of memory multiple times. We want to partition the available machine cache into three parts: input vector chunk, result vector chunk, and some working space (where our current list of operations will be pulled from).
Now, walk the list of expressions pulling out expressions that fall into the A-I and S[1]-S[400] range. Then walk again pulling J-T (or whatever fits in cache) and pull those operations next, once you get to the end of the operations list repeat for s[401]-s[800]. This is the final order of execution for the operations. Note that this can be parallelized without contention across the S bands.
The down side is that you do not get the early out behavior. The upside is you only have cache failures as you transition blocks of computation. For such a large data set I suspect this (and the elimination of all branching) will overwhelm the early out advantage.
If you still want to try to use the early out optimization you can it is just harder to implement. Consider: once you have your cache bracket A-I & S[1]-s[400], and you have created a list of operations across that bracket:
S[1] &= A[10];
S[1] &= B[52];
S[1] &= F[15];
S[2] &= I[8];
You can then reorder the operations to group them by S[x] (which this example already was). Now if you find A[10] is false you can "early out" to the S[2] block. As far as how to implement this? Well, your operations now need to know how many to skip forward from the current operation:
Operation[x ] => (S[1] &= A[10], on false, x+=3)
Operation[x+1] => (S[1] &= B[52], on false, x+=2)
Operation[x+2] => (S[1] &= F[15], on false, x+=1)
Operation[x+3] => (S[2] &= I[8]...
Again, I suspect simply adding the branching in will be slower than just performing all of the other work. This is not a full early out process since the when you move to the next input block chunk you'll have to reinspect each S[x] value accessed to determine if it has already failed and should be skipped.
答案 2 :(得分:2)
答案 3 :(得分:1)
我建议您预处理表达式以生成:
然后对每个输入: