我有两个DataFrame,其中一个包含公交车站号码列表df_stops
,另一个包含公交车到港,df_arrivals
,(StopNumber
和OnTimeStatus
= -1
,0
或1
,分别对应于公交车是早期,按时或晚期的。)
我想在df_stops
DataFrame中添加3个新列:
PercentEarly
PercentOnTime
PercentLate
我很难在不使用循环迭代的情况下弄清楚如何做到这一点。如果我是迭代地做,我会做一些事情:
for row in df_stops:
# number of early arrivals / total number of arrivals @ that stop
row['PercentEarly'] =
df_arrivals.loc[df_arrivals['StopNum'] == row['StopNum'] and df_arrivals['OnTimeStatus'] < 0].count()
/ df_arrivals.loc[df_arrivals['StopNum'] == row['StopNum']].count()
# same idea for on time and late arrivals
我对熊猫和数据科学一般都比较陌生,所以任何帮助都会受到赞赏。
如何在不迭代df_stops
的每一行的情况下执行此操作?
编辑:
df_arrivals
RouteNumber ScheduledUnix StopNumber OnTimeStatus
0 44 1511977533 40888 0
1 44 1511979273 40888 0
2 44 1511979273 40888 0
3 44 1511980353 40888 0
4 44 1511979273 40888 0
5 44 1511980353 40888 1
... ... ... ... ...
67538 85 1512005100 40900 0
67539 85 1512008700 40900 0
67540 85 1512008700 40900 -1
67541 85 1512008700 40900 0
67542 85 1512012300 40900 0
df_stops
:
StopNumber
0 40877
1 40874
2 40876
3 40725
4 40875
5 40776
6 40730
7 40723
8 40721
9 40729
10 40722
所需的输出看起来像:
StopNumber EarlyPercent OnTimePercent LatePercent
0 40877 0.14 0.80 0.06
...
答案 0 :(得分:0)
你可以使用groupby
for stops in df_arrivals.groupby('StopNum'):
stop[1].groupby('OnTimeStatus').count()
它现在按预期工作吗?
答案 1 :(得分:0)
我从未在没有迭代的情况下弄清楚如何做到这一点。我还决定存储早期/准时/晚期的数量而不是百分比。这是我的解决方案,即使给出了成千上万的条目,它似乎也非常快:
chain
答案 2 :(得分:0)
回答有关事件发生次数的问题:
我要做的是:
#This represents all early, ontime, and late arrivals. If you want to grab per stopnum then you need to groupby first (see below)
#Define a specific stop num and store as stop_num = the number
early, ontime, late = df_arrivals[df_arrivals.stop_number == stop_num].OnTimeStatus.value_counts()[-1], df_arrivals.OnTimeStatus.value_counts()[0], df_arrivals.OnTimeStatus.value_counts()[1]
total_stops = len(df_stops[df_stops.StopNumber == stop_num])
EarlyPercent= early/total_stops
OntimePercent= ontime/total_stops
LatePercent= late/total_stops
现在请记住,这只是每一站的数量。实际上,我认为在没有过于复杂的代码(链接等等)的情况下,有一种方法可以避免迭代。
df_stops['PercentEarly'] = ''
df_stops['PercentOntime'] = ''
df_stops['PercentLate'] = ''
for stop_num in df_arrivals.stop_number.tolist():
early, ontime, late = df_arrivals[df_arrivals.stop_number == stop_num].OnTimeStatus.value_counts()[-1], df_arrivals.OnTimeStatus.value_counts()[0], df_arrivals.OnTimeStatus.value_counts()[1]
total_stops = len(df_stops[df_stops.StopNumber == stop_num])
EarlyPercent= early/total_stops
OntimePercent= ontime/total_stops
LatePercent= late/total_stops
df_stops.loc[df_stops.StopNumber == stop_num, 'PercentEarly'] =EarlyPercent
df_stops.loc[df_stops.StopNumber == stop_num, 'PercentOnTime'] = OntimePercent
df_stops.loc[df_stops.StopNumber == stop_num, 'PercentLate'] =LatePercent