遍历pandas DataFrame中的列表

时间:2019-02-28 14:20:25

标签: python pandas numpy

我原来的熊猫DataFrame看起来像这样:

df =

    Person_ID | trip_purpose | trip_start_time | trip_end_time
    -----------------------------------------------------------
         1    |    'Work'    |   05:40:00      |  05:42:00
         2    |   'School'   |   06:40:00      |  06:45:00
         1    |   'Leisure'  |   05:52:00      |  06:37:00
         1    |    'Home'    |   06:40:00      |  06:49:00  
        ...

第一步:按Person_ID分组:

df = df.groupby('Person_ID').agg(lambda x : ','.join(x).split(',')) 
# this is faster than grouping by .agg(list)

分组结果:

Person_ID |   trip_purpose   | trip_start_time | trip_end_time
---------------------------------------------------------------
          |    ['Work',      |   [05:40:00,    |  [05:42:00,
     1    |     'Leisure',   |    05:52:00,    |   06:37:00,
          |     'Home']      |    06:40:00]    |   06:49:00]
          |                  |                 |
     2    |   ['School']     |   [06:40:00 ]   |  [06:45:00]   
    ...

在下一步中,我想计算出该人在两次旅行之间花费在每个活动上的时间,并将其写入特定的列。 由于我的特工在家里开始新的一天,所以拳头活动总是“活着”,这意味着duration_activity_1是在第一个开始时间给出的

pd.to_timedelta(df['trip_start_time'].apply(lambda x: x[0]))

这意味着,如果第1个人进行3次旅行,则由于他/她从家里开始,他/她有4次活动。

最后一个活动持续到午夜,这意味着它是由

计算的

pd.to_datetime("23:59:59") - pd.to_datetime(df['trip_end_time').apply(lambda x: x[-1]))

通过减去当前行程的开始时间减去最后行程的结束时间,可以计算出第一和最后一次之间的所有活动持续时间:

pd.to_timedelta(df['trip_start_time'].apply(lambda x: x[i])) - pd.to_timedelta(df['trip_end_time'].apply(lambda x: x[i - 1]))]

它应该是这样的:

Person_ID |   trip_purpose   | trip_start_time | trip_end_time | duration_activity_1 | duration_activity_2 | duration_activity_3 | duration_activity_4 | ...
------------------------------------------------------------------------------------------------------------------------------------------------------------
          |    ['Work',      |   [05:40:00,    |  [05:42:00,   |                     |                     |                     |                     |  
     1    |     'Leisure',   |    05:52:00,    |   06:37:00,   |    05:40:00         |     00:10:00        |     00:03:00        |     17:14:00        |
          |     'Home']      |    06:40:00]    |   06:49:00]   |                     |                     |                     |                     |
          |                  |                 |               |                     |                     |                     |                     |
     2    |   ['School']     |   [06:40:00 ]   |  [06:45:00]   |    06:40:00         |     17:15:00        |          nan        |          nan        |
    ...

由于我想以标量方式进行计算,因此我想到了多个numpy.select()

import numpy as np
import pandas as pd

    for i in range(maximum_number_of_activities):
        condlist = [i == 0,    # first activity
                    i == df["trip_purpose"].apply(len),    # last activity
                    (i > 0) & (i < df["trip_purpose"].apply(len))]    # other activities
        choicelist = [pd.to_timedelta(df["trip_start_time"].apply(lambda x: x[0])),    # first activity starts at midnight and ends with the first trip
                      pd.to_datetime("23:59:59") - pd.to_datetime(df["trip_end_time"].apply(lambda x: x[-1])),    # last activity starts with the last trip and ends at midnight
                      pd.to_timedelta(df["trip_start_time"].apply(lambda x: x[i])) - pd.to_timedelta(df["trip_end_time"].apply(lambda x: x[i - 1]))]    # all other activities are calculated by substracting the start time of the current trip minus the end time of the last trip
        default = np.nan
        print(pd.DataFrame(np.select(condlist=condlist,
                                     choicelist=choicelist,
                                     default=default),
                           columns=[i]))

# I'm aware that this code is not assigning it to the original DataFrame.

这是我的问题:我得到一个IndexError: list index out of range

我想这与choicelist中的第三个条件有关。我想即使我遇到i == 0i == df["trip_purpose"].apply(len)的情况,我也不能在那里使用索引变量i,因为第三个条件将无效?!? (如果我直接将choicelist写到np.select函数中,则会得到相同的结果。)

您能想到解决此问题的方法还是获得我想要的DataFrame的另一种方法?非常感谢您的帮助。

0 个答案:

没有答案