我有一个难以解决的问题。我有数百万行,我需要标记当前行和上一行之间的重叠日期。行按'KEY'分组,在此分组中,我需要标记具有'Date1'的行,这些行与前一行的'Date2'重叠。
重叠行是第二行的Date1小于上一行的Date2,第二行的Date1大于或等于前一行的Date1。
简单地说:如果第二行的date1落在上一行的date1和date2之间,则将两行标记为重叠行。仅供参考,在任何给定的行上,Date1永远不会大于Date2。
prev row date1< = second row date1< prev row date2
我无法弄清楚的难点是这个步骤需要按顺序执行。也就是说,如果该组中的第二行被标记,则该组中的下一行(第3行)将与第一行进行比较(在这种情况下,第一行也将被标记为与第2行重叠)。
这是一个数据集:
df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
df[['KEY','Date1','Date2']]
KEY Date1 Date2
0 100000003 2012-05-06 2012-05-28
1 100000009 2012-05-06 2012-06-10
2 100000009 2012-05-07 2012-06-15
3 100000009 2012-06-08 2012-06-29
4 100000009 2012-06-20 2012-06-21
5 100000034 2012-02-06 2012-03-05
6 100000034 2012-03-04 2012-05-06
7 100000034 2012-04-05 2012-05-06
由于有数百万行并且每个组的大小各不相同,我写了一个for循环,它只会迭代最多的groupby组数。
for item in range(df.groupby('KEY')['KEY'].count().max()):
df['PrevDate1'] = df.groupby('KEY')['Date1'].shift(1)
df['PrevDate2'] = df.groupby('KEY')['Date2'].shift(1)
df['Overlapping_Hospitalizations'] = np.where(df['Date1'].between(df['PrevDate1'],df['PrevDate2']),'Y','N')
print("DONE")
df
这适用于每个以前的KEY,但我还需要它与导致该分组重叠的初始KEY进行比较。
预期结果:
KEY Date1 Date2 OverlappingFlag
0 100000003 2012-05-06 2012-05-28 N
1 100000009 2012-05-06 2012-06-10 Y
2 100000009 2012-05-07 2012-06-15 Y
3 100000009 2012-06-08 2012-06-29 Y
4 100000009 2012-06-20 2012-06-21 Y
5 100000034 2012-02-06 2012-03-05 Y
6 100000034 2012-03-04 2012-05-06 Y
7 100000034 2012-04-05 2012-05-06 Y
编辑:两个重叠的行都需要标记。见预期结果。
最终答案:
for item in range(df.groupby('KEY')['KEY'].count().max()):
df['overlap'] = (((df['KEY'] == df['KEY'].shift()) & \
(df['Date1'] >= df['Date1'].shift(1)) & \
(df['Date1'] < df['Date2'].shift(1))) | \
((df['KEY'] == df['KEY'].shift(-1)) & \
(df['Date1'].shift(-1) >= df['Date1']) & \
(df['Date1'].shift(-1) < df['Date2'])))
答案 0 :(得分:1)
我不清楚你的逻辑或KEY的重要性;怎么样?
import pandas as pd
import numpy as np
df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
df.set_index('KEY', inplace = True)
df['overlap'] = ((df.Date1 > df.Date1.shift()) & \
(df.Date1 < df.Date2.shift())) | \
((df.Date1 < df.Date1.shift(-1)) & \
(df.Date2 < df.Date2.shift(-1)))
输出:
Date1 Date2 overlap
KEY
100000003 2012-05-06 2012-05-28 False
100000009 2012-05-06 2012-06-10 True
100000009 2012-05-07 2012-06-15 True
100000009 2012-06-08 2012-06-29 True
100000009 2012-06-20 2012-06-21 True
100000034 2012-02-06 2012-03-05 True
100000034 2012-03-04 2012-05-06 True
100000034 2012-04-05 2012-05-06 True
答案 1 :(得分:1)
看起来问题中的预期结果不符合定义:
行按&#39; KEY&#39;分组。并且在这个分组我需要标记 有&#39; Date1&#39;的行与日期2&#39;重叠以前的 行。
KEY Date1 Date2 OverlappingFlag
0 100000003 2012-05-06 2012-05-28 N
1 100000009 2012-05-06 2012-06-10 Y # probably not
2 100000009 2012-05-07 2012-06-15 Y
3 100000009 2012-06-08 2012-06-29 Y
4 100000009 2012-06-20 2012-06-21 Y
5 100000034 2012-02-06 2012-03-05 Y # probably not
6 100000034 2012-03-04 2012-05-06 Y
7 100000034 2012-04-05 2012-05-06 Y
此案例的@Evan代码的扩展名:
import pandas as pd
import numpy as np
df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
# if KEY is already an index, df = df.reset_index()
# df.set_index('KEY', inplace = True)
# this is really the only part changed
df['overlap'] = ((df.KEY == df.KEY.shift()) & \
(df.Date1 < df.Date2.shift())) | \
((df.KEY == df.KEY.shift(-1)) & \
(df.Date2 < df.Date1.shift(-1)))
df.set_index('KEY', inplace = True)