具有这样的数据框-
df = {'Request': [0, 0, 1, 0, 1, 0, 0],
'Time': ['16:00', '17:00', '18:00', '19:00', '20:00', '20:30', '24:00'],
'grant': [3, 0, 0, 5, 0, 0, 5]}
pd.DataFrame(df).set_index('Time')
Out[16]:
Request grant
Time
16:00 0 3
17:00 0 0
18:00 1 0
19:00 0 5
20:00 1 0
20:30 0 0
24:00 0 5
“请求”列中的值是布尔值,表示是否发出了请求。 1 =请求0 =无请求。 “赠款”列中的值表示初始赠款额。
我想为每个请求计算请求和授予之间的时间。因此,在这种情况下,他们将是19:00-18:00 = 1小时和24:00-20:00 = 4小时。有没有办法使用熊猫轻松地对大型数据集执行此操作?
答案 0 :(得分:1)
我会这样处理:
df = {'Request': [0, 0, 1, 0, 1, 0, 0],
'Time': ['16:00', '17:00', '18:00', '19:00', '20:00', '20:30', '24:00'],
'grant': [3, 0, 0, 5, 0, 0, 5]}
df = pd.DataFrame(df) #create DataFrame
#get rid of any rows have neither a grant nor request
df = df[(df[['grant', 'Request']].T != 0).any()]
#change the time in HH:MM to number of minutes
df['Time'] = df['Time'].str.split(":").apply(lambda x: int(x[0])*60 + int(x[1]))
#get the difference between those times
df['timeElapsed'] = df['Time'].diff()
#filter out the requests to only get the grants and their times.
#Also, drop the NA from the first line.
df = df[(df[['grant']].T != 0).any()].dropna()
#drop all columns except timeElapsed and Grant
df = df[['timeElapsed', 'grant']]
然后输出类似于timeElaped(以分钟为单位):
timeElapsed grant
3 60.0 5
6 240.0 5
答案 1 :(得分:0)
首先,您需要将Can't parse '2018.000000106' as date with format 'YYYYMMDD'
索引转换为可减去的值才能找到时间增量。因为没有Time
,所以无法使用pd.to_timestamp
。下面的解决方案使用十进制时间(1:30 PM = 13.5):
24:00
结果:
# Convert the index into decimal time
df.index = pd.to_timedelta(df.index + ':00') / pd.Timedelta(hours=1)
# Get time when each request was made
r = df[df['Request'] != 0].index.to_series()
# Get time where each grant was made
g = df[df['grant'] != 0].index.to_series()
# `asof` mean "get the last available value in `r` as the in `g.index`
tmp = r.asof(g)
df['Delta'] = tmp.index - tmp