我有一个大型的csv文件,包含40多个列,我正在尝试使用pandas
对其进行排序,并且仅将选定的文件写入新文件。这是我的代码:
编辑:假设我正确地完成了所有操作直到最后,这可能是错误的,这是整个文件:我读取了10个csv文件,将它们添加到一个文件中,对行进行过滤,以使其在某种程度上是唯一的我需要它们,然后我想再次过滤,这次只选择几列。 我是python的新手,所以代码可能看起来令人作呕,而且我认为这是个问题。
if __name__ == "__main__":
files = ['airOT199701.csv', 'airOT199702.csv', 'airOT199703.csv', 'airOT199704.csv', 'airOT199705.csv', 'airOT199706.csv', 'airOT199707.csv', 'airOT199708.csv', 'airOT199709.csv', 'airOT199710.csv', 'airOT199711.csv', 'airOT199712.csv']
with open('filterflights.csv', 'w') as outcsv:
writer = csv.DictWriter(outcsv, fieldnames = ["YEAR","MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","UNIQUE_CARRIER","TAIL_NUM","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR","DEST_AIRPORT_ID","DEST","DEST_STATE_ABR","CRS_DEP_TIME","DEP_TIME","DEP_DELAY","DEP_DELAY_NEW","DEP_DEL15","DEP_DELAY_GROUP","TAXI_OUT","WHEELS_OFF","WHEELS_ON","TAXI_IN","CRS_ARR_TIME","ARR_TIME","ARR_DELAY","ARR_DELAY_NEW","ARR_DEL15","ARR_DELAY_GROUP","CANCELLED","CANCELLATION_CODE","DIVERTED","CRS_ELAPSED_TIME","ACTUAL_ELAPSED_TIME","AIR_TIME","FLIGHTS","DISTANCE","DISTANCE_GROUP","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY","DIFFERENCE"])
writer.writeheader()
filewriter = csv.writer(outcsv, delimiter=',')
for i in range(len(files)):
reader = csv.reader(open(files[i], 'r'), delimiter=',')
next(reader, None)
result = set()
for r in reader:
r.append(abs(int(r[8])-int(r[11]))%25)
key = (r[7],r[8],r[11])
if key not in result:
filewriter.writerow(r)
result.add(key)
df = pd.read_csv('filterflights.csv')
df.header(3)
df = df[["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]]
df.header(3)
df.to_csv('filteredflights.csv', index=False)
我收到错误:AttributeError:第23行中的'DataFrame'对象没有属性'header'。所有csv文件与python文件位于同一文件夹中
可能的问题:原始的csv文件没有DIFFERENCE
列,这会导致此问题吗?试图用r.append附加值,但是也许不知道要附加什么?
答案 0 :(得分:0)
您可以使用OnPropertyChanged子集数据框架并保留给定顺序,
col_subset = ["FL_DATE","FL_NUM","ORIGIN_AIRPORT_ID","ORIGIN","ORIGIN_STATE_ABR", "DEST_AIRPORT_ID","DEST","DEST_STATE_ABR", "DEP_TIME", "ARR_TIME", "DISTANCE", "DIFFERENCE"]
df = df.reindex(columns= col_subset)