我有几十个具有相似(但不总是完全相同)标头的csv文件。例如,一个人有:
Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site
一个人:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site
(注意一个缺少" U_Global"和" U_IR",另一个" Diffuse2"而不是" Diffuse")
我知道如何将多个csv传递到我的脚本中,但是如何让csv仅将值传递给它们当前具有值的列?也许通过" Nan"到该行中的所有其他列。
理想情况下,我有类似的东西:
'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"
另一个警告是,需要将此数据框附加到。它需要保留,因为多个csv文件传递给它。我想我可能会在最后写出它自己的csv(它最终会转到NETCDF4)。
答案 0 :(得分:4)
假设您有以下CSV文件:
test1.csv:
year,month,day,Direct
1992,1,1,11
2013,5,30,11
2004,9,1,11
test2.csv:
year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203
test3.csv:
year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date
<强>解决方案:强>
import glob
import pandas as pd
files = glob.glob(r'd:/temp/test*.csv')
def get_merged(files, **kwargs):
df = pd.read_csv(files[0], **kwargs)
for f in files[1:]:
df = df.merge(pd.read_csv(f, **kwargs), how='outer')
return df
print(get_merged(files))
<强>输出:强>
year month day Direct Direct Direct2 File3
0 1992 1 1 11.0 21.0 201.0 text1
1 2013 5 30 11.0 21.0 202.0 text2
2 2004 9 1 11.0 21.0 203.0 text3
3 2016 1 1 NaN NaN NaN unmatching_date
更新:通常惯用pd.concat(list_of_dfs)
解决方案无法在此处工作,因为它通过索引加入:
In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
或明确使用index_col=None
:
In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
以下更惯用的解决方案有效,但它会更改列和行/数据的原始顺序:
In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
...:
...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
...:
...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
...:
Out[224]:
month day year Direct Direct Direct2 File3
0 1 1 1992 11.0 21.0 201.0 text1
1 1 1 2016 NaN NaN NaN unmatching_date
2 5 30 2013 11.0 21.0 202.0 text2
3 9 1 2004 11.0 21.0 203.0 text3
答案 1 :(得分:3)
大熊猫不能自动解决这个问题吗?
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
如果您的指数重叠,请不要忘记添加'ignore_index = True'
答案 2 :(得分:1)
首先,遍历所有文件以定义公共标题:
csv_path = './csv_files'
csv_separator = ','
full_headers = []
for fn in os.listdir(csv_path):
with open(fn, 'r') as f:
headers = f.readline().split(csv_separator)
full_headers += full_headers + list(set(full_headers) - set(headers))
然后将标题行写入输出文件,然后再次运行所有文件以填充它。
您可以使用:csv.DictReader(open('myfile.csv'))
来简单地将标题与其指定的列匹配。