迭代两个pandas数据帧之间的日期范围以进行类别计数

时间:2015-12-23 10:45:30

标签: python python-2.7 pandas

我有两个pandas数据帧(df1和df2):

df1有12列,其中a1,a2,...,a9是空列。以下是df1的示例:

Stock Start_Date          End_Date        a1 a2 a3 a4 .... a9
A   09-12-2015 20:04    10-12-2015 23:04                
B   09-12-2015 10:04    09-12-2015 20:14                
A   11-12-2015 00:22    11-12-2015 08:04                
C   08-12-2015 06:56    10-12-2015 20:54                

df2有4列。以下是一个示例:

Stock   date_time     Opening   closing
A   09-12-2015 21:24    144.3   10
A   09-12-2015 21:27    225.51  24
B   09-12-2015 10:20    134.42  11
A   09-12-2015 20:04    231.22  17
B   09-12-2015 10:24    399.55  32
A   09-12-2015 20:04    246.77  21
B   09-12-2015 14:22    76.23   8
C   08-12-2015 09:44    232.22  15
C   09-12-2015 20:04    222.91  12
A   11-12-2015 02:06    93.21   7
B   09-12-2015 20:04    211.36  26
C   09-12-2015 20:04    111.21  8

现在,我希望输出像这样,df1:

Stock   Start_Date       End_Date          a1   a2  a3  a4 ....a9
A   09-12-2015 20:04    10-12-2015 23:04    0   2   2   0      0
B   09-12-2015 10:04    09-12-2015 20:14    1   1   2   0      0
A   11-12-2015 00:22    11-12-2015 08:04    1   0   0   0      0
C   08-12-2015 06:56    10-12-2015 20:54    0   0   0   1      0

即每个股票,Start_Date& End_Date df1的组合,结果应该具有df2的日期时间范围内每个类别的计数。

这里在最终输出中,a1 =计数[打开(0-100)和关闭(0-10)],a2 =计数[打开(101-200)和关闭(11-20)],a3 =计数[打开(201-400)和关闭(21-50)],a4 =计数[打开(0-100)和关闭(11-20)]等等,所有9种组合。

我有R代码,但对于更大的数据集效果不佳。任何人都可以帮我解决如何在python / pandas中执行此操作。任何帮助表示赞赏!!

1 个答案:

答案 0 :(得分:1)

您可以尝试使用此解决方案,我会删除df1的空列,但它也适用于它们:

#merge dataframes by Stock, select datetimes between start and end
df = df1.merge(df2,on='Stock', how='left')
df = df[(df.date_time >= df.Start_Date) & (df.date_time <= df.End_Date)]
#remove column date_time
df = df.drop(['date_time'], axis=1)
print df
#   Stock          Start_Date            End_Date  Opening  closing
#0      A 2015-09-12 20:04:00 2015-10-12 23:04:00   144.30       10
#1      A 2015-09-12 20:04:00 2015-10-12 23:04:00   225.51       24
#2      A 2015-09-12 20:04:00 2015-10-12 23:04:00   231.22       17
#3      A 2015-09-12 20:04:00 2015-10-12 23:04:00   246.77       21
#5      B 2015-09-12 10:04:00 2015-09-12 20:14:00   134.42       11
#6      B 2015-09-12 10:04:00 2015-09-12 20:14:00   399.55       32
#7      B 2015-09-12 10:04:00 2015-09-12 20:14:00    76.23        8
#8      B 2015-09-12 10:04:00 2015-09-12 20:14:00   211.36       26
#13     A 2015-11-12 00:22:00 2015-11-12 08:04:00    93.21        7
#14     C 2015-08-12 06:56:00 2015-10-12 20:54:00   232.22       15
#15     C 2015-08-12 06:56:00 2015-10-12 20:54:00   222.91       12
#16     C 2015-08-12 06:56:00 2015-10-12 20:54:00   111.21        8

#values to new columns by conditions - cast boolean to integers
df['a1'] = ((df.Opening.between(0,100)) & (df.closing.between(0,10))).astype(int)
df['a2'] = ((df.Opening.between(100,200)) & (df.closing.between(11,20))).astype(int)
#add other columns like a1 and a2
print df
#   Stock          Start_Date            End_Date  Opening  closing  a1  a2
#0      A 2015-09-12 20:04:00 2015-10-12 23:04:00   144.30       10   0   0
#1      A 2015-09-12 20:04:00 2015-10-12 23:04:00   225.51       24   0   0
#2      A 2015-09-12 20:04:00 2015-10-12 23:04:00   231.22       17   0   0
#3      A 2015-09-12 20:04:00 2015-10-12 23:04:00   246.77       21   0   0
#5      B 2015-09-12 10:04:00 2015-09-12 20:14:00   134.42       11   0   1
#6      B 2015-09-12 10:04:00 2015-09-12 20:14:00   399.55       32   0   0
#7      B 2015-09-12 10:04:00 2015-09-12 20:14:00    76.23        8   1   0
#8      B 2015-09-12 10:04:00 2015-09-12 20:14:00   211.36       26   0   0
#13     A 2015-11-12 00:22:00 2015-11-12 08:04:00    93.21        7   1   0
#14     C 2015-08-12 06:56:00 2015-10-12 20:54:00   232.22       15   0   0
#15     C 2015-08-12 06:56:00 2015-10-12 20:54:00   222.91       12   0   0
#16     C 2015-08-12 06:56:00 2015-10-12 20:54:00   111.21        8   0   0

#groupby and sum rows
df= df.groupby(['Stock', 'Start_Date', 'End_Date']).sum()
df = df.drop(['Opening', 'closing'], axis=1)
print df.reset_index()
#  Stock          Start_Date            End_Date  a1  a2
#0     A 2015-09-12 20:04:00 2015-10-12 23:04:00   0   0
#1     A 2015-11-12 00:22:00 2015-11-12 08:04:00   1   0
#2     B 2015-09-12 10:04:00 2015-09-12 20:14:00   1   1
#3     C 2015-08-12 06:56:00 2015-10-12 20:54:00   0   0