熊猫用df [[]]

时间:2019-10-10 06:21:27

标签: python pandas

我有一个大型的csv文件,包含40多个列,我正在尝试使用pandas对其进行排序,并且仅将选定的文件写入新文件。这是我的代码:

编辑:假设我正确地完成了所有操作直到最后,这可能是错误的,这是整个文件:我读取了10个csv文件,将它们添加到一个文件中,对行进行过滤,以使其在某种程度上是唯一的我需要它们,然后我想再次过滤,这次只选择几列。 我是python的新手,所以代码可能看起来令人作呕,而且我认为这是个问题。

if __name__ == "__main__":
    files = ['airOT199701.csv', 'airOT199702.csv', 'airOT199703.csv', 'airOT199704.csv', 'airOT199705.csv', 'airOT199706.csv', 'airOT199707.csv', 'airOT199708.csv', 'airOT199709.csv', 'airOT199710.csv', 'airOT199711.csv', 'airOT199712.csv']
    with open('filterflights.csv', 'w') as outcsv:
        writer = csv.DictWriter(outcsv, fieldnames = ["YEAR","MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","UNIQUE_CARRIER","TAIL_NUM","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR","DEST_AIRPORT_ID","DEST","DEST_STATE_ABR","CRS_DEP_TIME","DEP_TIME","DEP_DELAY","DEP_DELAY_NEW","DEP_DEL15","DEP_DELAY_GROUP","TAXI_OUT","WHEELS_OFF","WHEELS_ON","TAXI_IN","CRS_ARR_TIME","ARR_TIME","ARR_DELAY","ARR_DELAY_NEW","ARR_DEL15","ARR_DELAY_GROUP","CANCELLED","CANCELLATION_CODE","DIVERTED","CRS_ELAPSED_TIME","ACTUAL_ELAPSED_TIME","AIR_TIME","FLIGHTS","DISTANCE","DISTANCE_GROUP","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY","DIFFERENCE"])
        writer.writeheader()
        filewriter = csv.writer(outcsv, delimiter=',')
        for i in range(len(files)):
            reader = csv.reader(open(files[i], 'r'), delimiter=',')
            next(reader, None)
            result = set()
            for r in reader:
                r.append(abs(int(r[8])-int(r[11]))%25)
                key = (r[7],r[8],r[11])
                if key not in result:
                    filewriter.writerow(r)
                    result.add(key)
    df = pd.read_csv('filterflights.csv')
    df.header(3)
    df = df[["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]]
    df.header(3)
    df.to_csv('filteredflights.csv', index=False)

我收到错误:AttributeError:第23行中的'DataFrame'对象没有属性'header'。所有csv文件与python文件位于同一文件夹中

可能的问题:原始的csv文件没有DIFFERENCE列,这会导致此问题吗?试图用r.append附加值,但是也许不知道要附加什么?

1 个答案:

答案 0 :(得分:0)

您可以使用OnPropertyChanged子集数据框架并保留给定顺序,

col_subset = ["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]
df = df.reindex(columns= col_subset)