循环遍历大型DataFrame的更快捷方式

时间:2017-09-21 19:20:48

标签: python-3.x pandas dataframe python-performance

我希望改进我的代码,使其更加pythonic并提高处理数据的速度。目前的代码有效,但我确信这可以在某种程度上得到改善。 .csv文件是702 MB,因此我花了大约7-10分钟才能得到最终结果:

def delayed_vs_punctual(self, df):
    filtered_for_carriers = df['UniqueCarrier']
    number_of_entries_each_carrier = filtered_for_carriers.value_counts()
    carriers = number_of_entries_each_carrier.index

    percent_delayed_all = []
    for carrier in carriers:  
        total_number_of_carrier = number_of_entries_each_carrier[carrier]
        mask = df.loc[df['UniqueCarrier'] == carrier]

        d = 0
        for index, row in mask.iterrows():
            ArrDelay = row['ArrDelay']
            if ArrDelay > 0:
                d += 1
            else:
                pass
        percent_delayed = d/total_number_of_carrier
        percent_delayed_all.append(percent_delayed)

    percentage_delay_dict = dict(zip(carriers, percent_delayed_all))  

    return percent_delayed_all

我非常确定循环不是最好的方法。无论如何,样本数据:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3

基本上,我通过过滤和拼接DF,以便将其分组到航空公司(UniqueCarrier)。然后在这些新的迷你DF中(尽管仍然相当大),我会检查每一行是否有特定条件,就像是有延迟一样。然后计算百分比(针对该特定航空公司的总航班延误航班)。结果是dict:

percent_delayed_all = {'YV': 0.42212989448366295, 'US': 0.53435287477314719, 'MQ': 0.46239551225360503, 'AA': 0.49731090766529357, \
    'FL': 0.43394297743949478, 'NW': 0.56168732479989192, 'HA': 0.25596795727636851, 'F9': 0.50444967266775775, \
    'WN': 0.41947657183726861, 'OH': 0.50945518784192445, 'OO': 0.46118130333410273, '9E': 0.41249599190267761, \
    'B6': 0.45879864194306608, 'UA': 0.47631438239027596, 'AS': 0.47851546649186877, 'CO': 0.45207967792146703, \
    'AQ': 0.27577653149266607, 'XE': 0.40724700015870352, 'EV': 0.52604861756464993, 'DL': 0.45727049795225355}

从dict中你可以看到航空公司作为关键,以及延迟航班的百分比,因此FL将是43%。在这种情况下延迟意味着>比预计到达时间多0分钟。

1 个答案:

答案 0 :(得分:1)

df.ArrDelay.gt(0).groupby(df.UniqueCarrier).mean().to_dict()