避免迭代以获得熊猫中的出现次数

时间:2017-11-29 22:08:41

标签: python pandas dataframe

我有两个DataFrame,其中一个包含公交车站号码列表df_stops,另一个包含公交车到港,df_arrivals,(StopNumberOnTimeStatus = -101,分别对应于公交车是早期,按时或晚期的。)

我想在df_stops DataFrame中添加3个新列:

  1. PercentEarly
  2. PercentOnTime
  3. PercentLate
  4. 我很难在不使用循环迭代的情况下弄清楚如何做到这一点。如果我是迭代地做,我会做一些事情:

    for row in df_stops:
        # number of early arrivals / total number of arrivals @ that stop
        row['PercentEarly'] =
            df_arrivals.loc[df_arrivals['StopNum'] == row['StopNum'] and df_arrivals['OnTimeStatus'] < 0].count() 
            / df_arrivals.loc[df_arrivals['StopNum'] == row['StopNum']].count()
    
        # same idea for on time and late arrivals
    

    我对熊猫和数据科学一般都比较陌生,所以任何帮助都会受到赞赏。

    如何在不迭代df_stops的每一行的情况下执行此操作?

    编辑:

    df_arrivals

           RouteNumber  ScheduledUnix  StopNumber OnTimeStatus
    0               44     1511977533       40888            0
    1               44     1511979273       40888            0
    2               44     1511979273       40888            0
    3               44     1511980353       40888            0
    4               44     1511979273       40888            0
    5               44     1511980353       40888            1
    ...            ...            ...         ...          ...
    67538           85     1512005100       40900            0
    67539           85     1512008700       40900            0
    67540           85     1512008700       40900           -1
    67541           85     1512008700       40900            0
    67542           85     1512012300       40900            0
    

    df_stops

         StopNumber
    0         40877
    1         40874
    2         40876
    3         40725
    4         40875
    5         40776
    6         40730
    7         40723
    8         40721
    9         40729
    10        40722
    

    所需的输出看起来像:

         StopNumber    EarlyPercent    OnTimePercent    LatePercent
    0         40877            0.14             0.80           0.06
    ...
    

3 个答案:

答案 0 :(得分:0)

你可以使用groupby

for stops in df_arrivals.groupby('StopNum'):
    stop[1].groupby('OnTimeStatus').count()

它现在按预期工作吗?

答案 1 :(得分:0)

我从未在没有迭代的情况下弄清楚如何做到这一点。我还决定存储早期/准时/晚期的数量而不是百分比。这是我的解决方案,即使给出了成千上万的条目,它似乎也非常快:

chain

答案 2 :(得分:0)

回答有关事件发生次数的问题:

我要做的是:

#This represents all early, ontime, and late arrivals. If you want to grab per stopnum then you need to groupby first (see below)
#Define a specific stop num and store as stop_num = the number
early, ontime, late = df_arrivals[df_arrivals.stop_number == stop_num].OnTimeStatus.value_counts()[-1], df_arrivals.OnTimeStatus.value_counts()[0], df_arrivals.OnTimeStatus.value_counts()[1]

total_stops = len(df_stops[df_stops.StopNumber == stop_num])
EarlyPercent= early/total_stops
OntimePercent= ontime/total_stops
LatePercent= late/total_stops

现在请记住,这只是每一站的数量。实际上,我认为在没有过于复杂的代码(链接等等)的情况下,有一种方法可以避免迭代。

df_stops['PercentEarly'] = ''
df_stops['PercentOntime'] = ''
df_stops['PercentLate'] = ''

for stop_num in df_arrivals.stop_number.tolist():
    early, ontime, late = df_arrivals[df_arrivals.stop_number == stop_num].OnTimeStatus.value_counts()[-1], df_arrivals.OnTimeStatus.value_counts()[0], df_arrivals.OnTimeStatus.value_counts()[1]
    total_stops = len(df_stops[df_stops.StopNumber == stop_num])
    EarlyPercent= early/total_stops
    OntimePercent= ontime/total_stops
    LatePercent= late/total_stops
    df_stops.loc[df_stops.StopNumber == stop_num, 'PercentEarly'] =EarlyPercent
    df_stops.loc[df_stops.StopNumber == stop_num, 'PercentOnTime'] = OntimePercent
    df_stops.loc[df_stops.StopNumber == stop_num, 'PercentLate'] =LatePercent