快速算法创建多个间隔数据的并集

时间:2018-11-19 09:33:15

标签: python intervals

我有一个非常简单的问题和数据结构,但是数量如此之大,我需要找到一种有效的方法。

假设我有一个对象,该对象的属性为间隔。 例如:

int main() {

    // Values for time duration
         LARGE_INTEGER tFreq, tStart, tEnd;
         cudaEvent_t start, stop;
         float tms, ms;

         int a[N], b[N], c[N];  // CPU values
         int *dev_a, *dev_b, *dev_c;    // GPU values----------------------------------------------

          // Creating alloc for GPU--------------------------------------------------------------
         cudaMalloc((void**)&dev_a, N * sizeof(int));
         cudaMalloc((void**)&dev_b, N * sizeof(int));
         cudaMalloc((void**)&dev_c, N * sizeof(int));

    // Fill 'a' and 'b' from CPU
         for (int i = 0; i < N; i++) {
            a[i] = -i;
            b[i] = i * i;
        }

    // Copy values of CPU to GPU values----------------------------------------------------
         cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
         cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);


    //////////////////////////////////////
         QueryPerformanceFrequency(&tFreq);  // Frequency set
         QueryPerformanceCounter(&tStart);   // Time count Start

    // CPU operation
         add(a, b, c);

    //////////////////////////////////////
         QueryPerformanceCounter(&tEnd);     // TIme count End
         tms = ((tEnd.QuadPart - tStart.QuadPart) / (float)tFreq.QuadPart) * 1000;
    //////////////////////////////////////

    // show result of CPU
         cout << fixed;
         cout.precision(10);
         cout << "CPU Time=" << tms << endl << endl;

         for (int i = 0; i < N; i++) {
             printf("CPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
         }

         cout << endl;

    ///////////////////////////////////////
         cudaEventCreate(&start);
         cudaEventCreate(&stop);
         cudaEventRecord(start, 0);
    // GPU operatinog---------------------------------------------------------------------
         //add2 <<<N,1 >>> (dev_a, dev_b, dev_c);   // block
         //add2 << <1,N >> > (dev_a, dev_b, dev_c); // Thread
         add2 << <N/32+1, 32 >> > (dev_a, dev_b, dev_c);   // grid

    ///////////////////////////////////////
         cudaEventRecord(stop, 0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&ms, start, stop);
    ///////////////////////////////////////

    // show result of GPU
         cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
         cout << fixed;
         cout.precision(10);
         cout << "GPU Time=" << ms << endl << endl;


         for (int i = 0; i < N; i++) {
              printf("GPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
         }

    //Free GPU values
         cudaFree(dev_a);
         cudaFree(dev_b);
         cudaFree(dev_c);

         return 0;
}

我想合并它,以便重叠间隔成为一个对象。因此,示例的结果将变为

        `start      stop`
obj1      5          10
obj2      8          12
obj3      11         14
obj4      13         20
obj5      22         25
obj6      24         30
obj7      33         37
obj8      36         40

我为此使用python。请注意,我有成千上万的此类数据。

3 个答案:

答案 0 :(得分:1)

df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
df.loc['obj1', 'Startpoint'] = True # First line is startpoint
df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
df2['`start'] = df2['`start'].shift() 
df2.loc[df2['Endpoint'], ['`start', 'stop`']]

  #            `start  stop`
  #  obj4     5.0     20
  #  obj6    22.0     30
  #  obj8    33.0     40

查找间隔的所有开始和结束,仅保留那些行,然后将起始值移动一行,以使每个间隔的值在同一行中。

这都是大熊猫,所以我认为应该很快。

答案 1 :(得分:0)

按时间间隔对间隔进行排序时,此简单函数应在线性时间内工作:

def merge_intervals(intervals):
    result = []
    (start_candidate, stop_candidate) = intervals[0]
    for (start, stop) in intervals[1:]:
        if start <= stop_candidate:
            stop_candidate = max(stop, stop_candidate)
        else:
            result.append((start_candidate, stop_candidate))
            (start_candidate, stop_candidate) = (start, stop)
    result.append((start_candidate, stop_candidate))
    return result

intervals = [
    ( 5, 10),
    ( 8, 12),
    (11, 14),
    (13, 20),
    (22, 25),
    (24, 30),
    (33, 37),
    (36, 40),
]

assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]

答案 2 :(得分:0)

处理此类数据的最快方法是使用Union find data structuredisjoint data structure来跟踪一组元素,这些元素被划分为多个不相交的子集。 我将剩下数据结构的实现和设计留给您,因为有有效的方法来实现线性运行的不相交数据结构。