如何比较csv中的这些数据集? Python 2.7

时间:2017-12-13 16:34:38

标签: python python-2.7 csv

我有一个项目,我正在尝试创建一个程序,该程序将从www.transtats.gov获取csv数据集,这是美国航空公司航班的数据集。我的目标是找到从一个机场到另一个机场的航班总体上最严重的延误,这意味着它是“最糟糕的航班”。到目前为止,我有这个:

`import csv
    with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
    reader = csv.DictReader(csv_infile)
    total_delay = 0
    flight_count = 0
    flight_numbers = []
    delay_totals = []
    dest_list = [] #create empty list of destinations
    for row in reader:
        if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
            if row['FL_NUM'] not in flight_numbers:
                flight_numbers.append(row['FL_NUM'])
            if row['DEST'] not in dest_list: #if the dest is not already in the list
                 dest_list.append(row['DEST']) #append the dest to dest_list
    for number in flight_numbers:
        for row in reader:
            if row['ORIGIN'] == 'BOS': #for flights leaving BOS
                if row['FL_NUM'] == number:
                    if float(row['CANCELLED']) < 1: #if the flight is not cancelled
                        if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
                            total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
                            flight_count += 1 #add the flight to total flight count
    for row in reader:
        for number in flight_numbers:
                delay_totals.append(sum(row['DEP_DELAY']))`

我原以为我可以创建一个航班号列表和这些航班号的总延误列表,然后比较两者,看看哪个航班的延误总数最高。比较两个列表的最佳方法是什么?

3 个答案:

答案 0 :(得分:2)

我不确定我是否理解正确,但我认为您应该使用'FL_NUM'来实现此目的,其中key为{{1}}且值为总延迟。

答案 1 :(得分:1)

一般来说,我想消除Python代码中的循环。对于不大的文件,我通常会读取一次数据文件并构建一些我可以在最后分析的dict。以下代码未经过测试,因为我没有原始数据,但遵循我将使用的一般模式。

由于航班由目的地,目的地和航班号确定,我会将其捕获为tuple并将其作为我的字典中的关键字。

from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar 
for row in reader:
    if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
        if row['CANCELLED'] > 0:
             flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
             flight_delays[flight].append(float(row['DEP_DELAY']))


# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
    average_delay = sum(delays) / len(delays)
    if average_delay > worst_delay:
        worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
        worst_delay = average_delay

答案 2 :(得分:0)

一个非常简单的解决方案。添加两个新变量:

max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
  max_delay = float(row['DEP_DELAY'])
  delay_flight = #save the row number or flight number for reference.