Question

我有两个数据框，每个数据框都有关于具有开始和结束时间的事件的信息。问题是这两个数据帧具有不同的开始和结束时间，因为它们测量的是不同的东西。我想做的小麦是创造新事件，其中包含两者的信息。必须根据两个数据帧之间的任何分割来拆分这些。例如：

Dataframe A：

Start                End
2016-12-30 18:51:00  2016-12-30 19:37:00
2016-12-30 20:03:00  2016-12-30 20:11:00
2016-12-30 20:12:00  2016-12-30 21:02:00
2016-12-30 21:02:00  2016-12-30 21:04:00
2016-12-30 21:10:00  2016-12-30 21:12:00
2016-12-30 21:12:00  2016-12-30 21:32:00

数据框B：

Start                End
2016-12-30 18:33:45  2016-12-30 19:18:00
2016-12-30 19:18:00  2016-12-30 19:38:00
2016-12-30 19:38:00  2016-12-30 19:48:00
2016-12-30 19:48:00  2016-12-30 20:15:45
2016-12-30 20:15:45  2016-12-30 20:35:45
2016-12-30 20:35:45  2016-12-30 20:45:45
2016-12-30 20:45:45  2016-12-30 21:14:30
2016-12-30 21:14:30  2016-12-30 21:35:00

对于这些，理想的输出是：

Start                End
2016-12-30 18:51:00  2016-12-30 19:18:00
2016-12-30 19:18:00  2016-12-30 19:37:00
2016-12-30 20:03:00  2016-12-30 20:11:00
2016-12-30 20:12:00  2016-12-30 20:15:45
2016-12-30 20:15:45  2016-12-30 20:35:45
2016-12-30 20:35:45  2016-12-30 20:45:45
2016-12-30 20:45:45  2016-12-30 21:12:00
2016-12-30 21:12:00  2016-12-30 21:14:30
2016-12-30 21:14:30  2016-12-30 21:32:00

有几种方法我知道如何做到这一点。我可以将数据帧分解为分钟级别，并在几分钟内合并，但问题是每个数据帧是200万+行，这将是一个非常漫长的过程。

我也有SQL可以做到这一点但是当我试图运行它时，它花了太长时间，DBA杀死了这个过程。

SQL的作用是：

select 
a.UNIQUE_ID,
a,
b,
c,
d,
CASE WHEN B.START < A.START THEN A.START
ELSE  B.START END START,
CASE WHEN B.END > A.END THEN A.END
ELSE  B.END END END
from
(Select 
UNIQUE_ID,
START,
END,
a,
b,    
from table_1
)a
    join 
(
UNIQUE_ID,
Select 
START,
END,
c,
d    
from table_2) b
on 1=1
AND A.UNIQUE_ID = B.UNIQUE_ID
AND ((b.START between a.START and a.END)
or (b.end between a.START and a.END)
or (b.START < a.START and b.end > a.end)
or (a.START < b.START and a.end > b.end)
)
) a

这为该unique_id包含至少一个相同分钟的开始，结束时间的每个成对组合创建一行。然后，它使用case语句将每行减少到共享分钟。

我无法想到使用Pandas在python中复制此SQL的有效方法。我在pandas中知道的唯一合并函数必须具有相同的列才能合并，它们不能像我使用的连接一样。

pandas中是否有一种类型的合并我可以用来做类似的事情：

AND ((b.START between a.START and a.END)
or (b.end between a.START and a.END)
or (b.START < a.START and b.end > a.end)
or (a.START < b.START and a.end > b.end)
)

我能想到的唯一选择是循环遍历df中的每一行切片将另一个数据帧返回到DF b的那一行中只有几分钟的行，然后在这两个切片上合并并将所有这些合并连接到一起一个新的DF，但这需要很长时间。

感谢任何帮助。

Answer 1

我将使用我为question编写的实现，该实现与您的内容类似：

import pandas as pd

df_a = pd.DataFrame({'Start': ['2016-12-30 18:51:00',
                               '2016-12-30 20:03:00',
                               '2016-12-30 20:12:00',
                               '2016-12-30 21:02:00',
                               '2016-12-30 21:10:00',
                               '2016-12-30 21:12:00'],
                     'End': ['2016-12-30 19:37:00',
                             '2016-12-30 20:11:00',
                             '2016-12-30 21:02:00',
                             '2016-12-30 21:04:00',
                             '2016-12-30 21:12:00',
                             '2016-12-30 21:32:00']})
df_b = pd.DataFrame({'Start': ['2016-12-30 18:33:45',
                               '2016-12-30 19:18:00',
                               '2016-12-30 19:38:00',
                               '2016-12-30 19:48:00',
                               '2016-12-30 20:15:45',
                               '2016-12-30 20:35:45',
                               '2016-12-30 20:45:45',
                               '2016-12-30 21:14:30'],
                     'End': ['2016-12-30 19:18:00',
                             '2016-12-30 19:38:00',
                             '2016-12-30 19:48:00',
                             '2016-12-30 20:15:45',
                             '2016-12-30 20:35:45',
                             '2016-12-30 20:45:45',
                             '2016-12-30 21:14:30',
                             '2016-12-30 21:35:00']})

# Convert the strings to datetime
df_a['Start'] = pd.to_datetime(df_a['Start'], format='%Y-%m-%d %H:%M:%S')
df_a['End'] = pd.to_datetime(df_a['End'], format='%Y-%m-%d %H:%M:%S')
df_b['Start'] = pd.to_datetime(df_b['Start'], format='%Y-%m-%d %H:%M:%S')
df_b['End'] = pd.to_datetime(df_b['End'], format='%Y-%m-%d %H:%M:%S')

# Create labels for the two datasets
# These labels will help determine the overlaps downstream
df_a['Label'] = 'a'
df_b['Label'] = 'b'

# With the labels created, I can concatenate the dataframes now
df_concat = pd.concat([df_a, df_b])
df_concat = df_concat[['Label', 'Start', 'End']]  # Ordering the columns

# Convert the dataframe to a list of tuples
df_concat_rec = df_concat.to_records(index=False)

# Here's where I'm using my answer that I had used in the other question
timelist_new = []
for time in df_concat_rec:
    timelist_new.append((time[0], time[1], 'begin'))
    timelist_new.append((time[0], time[2], 'end'))

timelist_new = sorted(timelist_new, key=lambda x: x[1])

key = None
keylist = set()
aggregator = []

for idx in range(len(timelist_new[:-1])):
    t1 = timelist_new[idx]
    t2 = timelist_new[idx + 1]
    t1_key = str(t1[0])
    t2_key = str(t2[0])
    t1_dt = t1[1]
    t2_dt = t2[1]
    t1_pointer = t1[2]
    t2_pointer = t2[2]

    if t1_dt == t2_dt:
        keylist.add(t1_key)
        keylist.add(t2_key)
    elif t1_dt < t2_dt:
        if t1_pointer == 'begin':
            keylist.add(t1_key)
        if t1_pointer == 'end':
            keylist.discard(t1_key)

    key = ','.join(sorted(keylist))
    aggregator.append((key, t1_dt, t2_dt))

# This is where I filter out any records where there isn't an overlap and where the start and end dates are equal
filtered = [x for x in aggregator if ((len(x[0]) > 1) & (x[1] != x[2]))]

# Convert the list of tuples back to dataframe
final_df = pd.DataFrame.from_records(filtered, columns=['Label', 'Start', 'End'])

# Print final dataframe
print(final_df)

<强>输出：

   Label               Start                 End
0    a,b 2016-12-30 18:51:00 2016-12-30 19:18:00
1    a,b 2016-12-30 19:18:00 2016-12-30 19:37:00
2    a,b 2016-12-30 20:03:00 2016-12-30 20:11:00
3    a,b 2016-12-30 20:12:00 2016-12-30 20:15:45
4    a,b 2016-12-30 20:15:45 2016-12-30 20:35:45
5    a,b 2016-12-30 20:35:45 2016-12-30 20:45:45
6    a,b 2016-12-30 20:45:45 2016-12-30 21:02:00
7    a,b 2016-12-30 21:02:00 2016-12-30 21:04:00
8    a,b 2016-12-30 21:10:00 2016-12-30 21:12:00
9    a,b 2016-12-30 21:12:00 2016-12-30 21:14:30
10   a,b 2016-12-30 21:14:30 2016-12-30 21:32:00

Pandas在开始时间和结束时间加入两个不相等的数据帧

1 个答案: