我原来的熊猫DataFrame看起来像这样:
df =
Person_ID | trip_purpose | trip_start_time | trip_end_time
-----------------------------------------------------------
1 | 'Work' | 05:40:00 | 05:42:00
2 | 'School' | 06:40:00 | 06:45:00
1 | 'Leisure' | 05:52:00 | 06:37:00
1 | 'Home' | 06:40:00 | 06:49:00
...
第一步:按Person_ID分组:
df = df.groupby('Person_ID').agg(lambda x : ','.join(x).split(','))
# this is faster than grouping by .agg(list)
分组结果:
Person_ID | trip_purpose | trip_start_time | trip_end_time
---------------------------------------------------------------
| ['Work', | [05:40:00, | [05:42:00,
1 | 'Leisure', | 05:52:00, | 06:37:00,
| 'Home'] | 06:40:00] | 06:49:00]
| | |
2 | ['School'] | [06:40:00 ] | [06:45:00]
...
在下一步中,我想计算出该人在两次旅行之间花费在每个活动上的时间,并将其写入特定的列。
由于我的特工在家里开始新的一天,所以拳头活动总是“活着”,这意味着duration_activity_1
是在第一个开始时间给出的
pd.to_timedelta(df['trip_start_time'].apply(lambda x: x[0]))
。
这意味着,如果第1个人进行3次旅行,则由于他/她从家里开始,他/她有4次活动。
最后一个活动持续到午夜,这意味着它是由
计算的 pd.to_datetime("23:59:59") - pd.to_datetime(df['trip_end_time').apply(lambda x: x[-1]))
通过减去当前行程的开始时间减去最后行程的结束时间,可以计算出第一和最后一次之间的所有活动持续时间:
pd.to_timedelta(df['trip_start_time'].apply(lambda x: x[i])) - pd.to_timedelta(df['trip_end_time'].apply(lambda x: x[i - 1]))]
它应该是这样的:
Person_ID | trip_purpose | trip_start_time | trip_end_time | duration_activity_1 | duration_activity_2 | duration_activity_3 | duration_activity_4 | ...
------------------------------------------------------------------------------------------------------------------------------------------------------------
| ['Work', | [05:40:00, | [05:42:00, | | | | |
1 | 'Leisure', | 05:52:00, | 06:37:00, | 05:40:00 | 00:10:00 | 00:03:00 | 17:14:00 |
| 'Home'] | 06:40:00] | 06:49:00] | | | | |
| | | | | | | |
2 | ['School'] | [06:40:00 ] | [06:45:00] | 06:40:00 | 17:15:00 | nan | nan |
...
由于我想以标量方式进行计算,因此我想到了多个numpy.select()
:
import numpy as np
import pandas as pd
for i in range(maximum_number_of_activities):
condlist = [i == 0, # first activity
i == df["trip_purpose"].apply(len), # last activity
(i > 0) & (i < df["trip_purpose"].apply(len))] # other activities
choicelist = [pd.to_timedelta(df["trip_start_time"].apply(lambda x: x[0])), # first activity starts at midnight and ends with the first trip
pd.to_datetime("23:59:59") - pd.to_datetime(df["trip_end_time"].apply(lambda x: x[-1])), # last activity starts with the last trip and ends at midnight
pd.to_timedelta(df["trip_start_time"].apply(lambda x: x[i])) - pd.to_timedelta(df["trip_end_time"].apply(lambda x: x[i - 1]))] # all other activities are calculated by substracting the start time of the current trip minus the end time of the last trip
default = np.nan
print(pd.DataFrame(np.select(condlist=condlist,
choicelist=choicelist,
default=default),
columns=[i]))
# I'm aware that this code is not assigning it to the original DataFrame.
这是我的问题:我得到一个IndexError: list index out of range
我想这与choicelist
中的第三个条件有关。我想即使我遇到i == 0
和i == df["trip_purpose"].apply(len)
的情况,我也不能在那里使用索引变量i
,因为第三个条件将无效?!?
(如果我直接将choicelist
写到np.select
函数中,则会得到相同的结果。)
您能想到解决此问题的方法还是获得我想要的DataFrame的另一种方法?非常感谢您的帮助。