Question

我有数以百万计的非结构化3D矢量与任意值相关联 - 制作了一组4D矢量。为了更容易理解：我有与数十万个3D向量相关联的unixtime标记。我有很多时间戳，制作了一个非常大的数据集;超过3000万向量。

我需要搜索特定时间戳的特定数据集。

所以我想说我有以下数据：

时间戳 1407633943 ：

（0,24,58,1407633943）
（9,2,59,1407633943）

...

时间戳 1407729456 ：

（40,1,31,1407729456）
（3,5,7,1407729456）

...

等等

我希望按照以下方式进行快速查询：

查询示例1 ：

给我两个载体：

X＆gt; 4＆amp;＆amp; X＆lt; 9＆amp;＆amp; Y> -29＆amp;＆amp; Y＆lt; 100＆amp;＆amp; Z> 0.58＆amp;＆amp; Z＆lt; 0.99

给我这些向量的列表，这样我就能找到时间戳。

查询示例2 ：

给我两个载体：

X＆gt; 4＆amp;＆amp; X＆lt; 9＆amp;＆amp; Y> -29＆amp;＆amp; Y＆lt; 100＆amp;＆amp; Z> 0.58＆amp;＆amp; Z＆lt; 0.99＆amp;＆amp; W（时间戳）= 1407729456

到目前为止，我已经使用SQLite执行任务，但即使在列索引之后，每次查询也需要500毫秒到7秒。我正在寻找每个查询解决方案50ms-200ms之间的某个地方。

我可以使用哪种结构或技术来加快查询速度？

谢谢。

Answer 1

kd-trees在这里很有帮助。 kd树中的范围搜索是众所周知的问题。当然，一个查询的时间复杂度取决于输出大小（在最坏的情况下，如果所有向量都适合，则将遍历所有树）。但它的平均工作速度非常快。

Answer 2

我会使用octree。在每个节点中，我将使用时间戳作为关键字在hashtable中存储向量数组。

为了进一步提高性能，您可以使用CUDA，OpenCL，OpenACC，OpenMP，并实现在GPU或多核CPU上并行执行的算法。

Answer 3

BKaun：请接受我试图让您了解手头的问题。我想你已经想到了我的每一点，但也许在这里看到它们会有所帮助。

无论如何呈现摄取数据，请考虑使用C编程语言，您可以减少数据的存储大小，以最大限度地减少空间和搜索时间。您将搜索，加载和解析向量的单个位，而不是像每个条目的2个字节的SHORT INT - 或者更多的FLOAT。据我所知，该对象是在给定数据中搜索给定的X，Y和Z值，然后在优化搜索时找到与这3个相关联的时间戳。我的解决方案不会进入搜索，而只是搜索中使用的数据。

为了简单说明我的提示，我正在考虑数据由4个向量组成：

X介于-2和7之间，
Y介于0.17和3.08之间，
Z介于0和50之间，
时间戳（许多相同大小 - 10位数）

要进行优化，请考虑每个向量可以包含多少个不同的数字： 1. X只能是10个数字（包括0） 2. Y可以是3.08减去0.17 = 2.91 x 100 = 291个数字 3. Z可以是51个数字 4.时间戳可以很多（但在这种情况下，你不是在寻找某个人）

考虑每个变量如何存储为二进制文件： 1.向量X中的每个条目可以存储为4位，使用第一位= 1 负号： 7 = “0111” 6 = “0110” 5 = “0101” 4 = “0100” 3 = “0011” 2 = “0010” 1 = “0001” 0 = “0000” -1 = “1001” -2 = “1010”

     However, the original data that you are searching through may range 
        from -10 to 20!
     Therefore, adding another 2 bits gives you a table like this:  
            -10="101010"
             -9="101001" ...
             ...
             -2="100010"
             -1="100001" ...
             ...
              8="001000"
              9="001001" ...
             ...
             19="001001"
             20="010100"

    And that's only 6 bits to store each X vector entry for integers from -10 to 20
    For search purposes on a range of -10 to 20, there are 21 different X Vector entries
        possible to search through.

Vector Y中的每个条目都可以存储为9位（不需要额外的符号位） 1和0可以分为2部分存储（访问，真正）（十位，十位2位）。第1部分可以是0,1,2或3（从“00”到“11”的4个2位）但是，如果整个Y数据集的范围是0到10，第1部分可以是0,1，... 9,10（这是11个4位的位数从“0000”到“1010” 第2部分可以是00,01，... 98,99（从“0000000”到“1100100”的100个7位）向量Y条目的总存储位是11 + 7 = 18位范围00.00至10.99 对于范围00.00到10.99的搜索目的，有1089个不同的Y Vector 条目可以搜索（11x99）（？）
Vector Z中0到50范围内的每个条目都可以存储在6位中（“000000”至“110010”）。同样，实际数据范围可能是7位长（为简单起见） 0到64（“0000000”到“1000000”）

对于0到64范围内的搜索目的，有65个不同的Z Vector条目可以搜索。

考虑您将以一种优化的格式存储数据比特的继承：

X = 4位+2范围位= 6位 + Y = 4位第1部分，7位第2部分= 11位 + Z = 7位

+时间戳（10个数字 - 每个从0到9（“0000”到“1001”）每个4位= 40位）

= TOTAL BITS：每个4D向量的6 + 11 + 7 + 40 = 64个存储位

搜索：

输入xx，yy，zz以搜索数组X，Y和Z（以二进制形式存储）按照上面优化的格式将xx，yy和zz更改为二进制位字符串。

功能（xx，yy，zz）

    Search for X first, since it has 21 possible outcomes (range is -10 to 10) 
       - the lowest number of any array
    First search for positive targets (there are 8 of them and better chance 
        of finding one)
         These all start with "000" 
             7="000111"
             6="000110"
             5="000101"
             4="000100"
             3="000011"
             2="000010"
             1="000001"
             0="000000"
          So you can check if the first 3 bits = "000".  If so, you have a number
          between 0 and 7.
             Found: search for Z
                Else search for xx=-2 or -1: does X = -2="100010" or -1="100001" ? 
                  (do second because there are only 2 of them)
                     Found: Search for Z
          NotFound: next X

    Search for Z after X is Found: (Z second, since it has 65 possible outcomes 
     - range is 0 to 64)
          You are searching for 6 bits of a 7 bit binary number
               ("0000000" to "1000000")  If bits 1,2,3,4,5,6 are all "0", analyze bit 0.  
                 If it is "1" (it's 64), next Z
                     Else begin searching 6 bits ("000000" to "110010") with LSB first
                        Found: Search for Y 
                        NotFound: Next X

    Search for Y (Y last, since it has 1089 possible outcomes - range is 0.00 to 10.99)
           Search for Part 1 (decimal place) bits (you are searching for 
            "0000", "0001" or "0011" only, so use yyPt1=YPt1)
                Found: Search for Part 2 ("0000000" to "1100100") using yyPt2=YPt2 
                (direct comparison)
                    Found:  Print out X, Y, Z, and timestamp
                NotFound: Search criteria for X, Y, and Z not found in data.  
                    Print X,Y,Z,"timestamp not found". Ask for new X, Y, Z. New search.

用于密集数据集4D向量的快速范围搜索的数据结构

3 个答案:

+时间戳（10个数字 - 每个从0到9（“0000”到“1001”）每个4位= 40位）