熊猫中的并发进程数

时间:2018-07-04 11:26:28

标签: python pandas dataframe time-series

给出一个 Pandas数据框,该数据框表示某些程序开始开始工作以及完成(即单行-单个程序)的时间:

      starts            finishes
2018-01-01 12:00    2018-01-01 15:00
2018-01-01 16:00    2018-01-01 20:00
2018-01-01 16:30    2018-01-01 20:00
2018-01-01 17:00    2018-01-01 21:00
                ...

我需要计算表中每次表示的并发程序数 。上表如下:

      time             number_of_conc_progs
2018-01-01 12:00                 1
2018-01-01 15:00                 0
2018-01-01 16:00                 1
2018-01-01 16:30                 2               
2018-01-01 17:00                 3
2018-01-01 20:00                 1                
2018-01-01 21:00                 0 
                     ...

如果某个程序在12:00(例如)启动,并且当前进程数为 n ,则在12:00,该数字的值为 n +1

如果程序在12:00(例如)完成 ,并且当前进程数是 n ,则在12:00,该数字的值是 n -1

1 个答案:

答案 0 :(得分:0)

# creation of the dataframe
df = pd.DataFrame([
    ["2018-01-01 12:00", "2018-01-01 15:00"], 
    ["2018-01-01 16:00", "2018-01-01 20:00"], 
    ["2018-01-01 16:30", "2018-01-01 20:00"], 
    ["2018-01-01 17:00", "2018-01-01 21:00"]])
df.columns = ["starts", "finishes"]

# number of progs increases of 1 for start times
starts = pd.DataFrame()
starts["time"] = df.starts
starts["number_of_conc_progs"] = 1

# number of progs decreases of 1 for finishes times
finishes = pd.DataFrame()
finishes["time"] = df.finishes
finishes["number_of_conc_progs"] = -1

# then I merge the starts and the finishes dataframes
result = pd.DataFrame()
result = pd.concat([starts,finishes])
# I sort the time values
result = result.sort_values(by=['time'])
# If there is several starts or finishes at the same time, I sum them
result = result.groupby(['time']).sum()
# I do a cumulation sum to get the actual number of progs running
result.number_of_conc_progs = result.number_of_conc_progs.cumsum()