Question

我有几十个具有相似（但不总是完全相同）标头的csv文件。例如，一个人有：

Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site

一个人：

Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site

（注意一个缺少＆＃34; U_Global＆＃34;和＆＃34; U_IR＆＃34;，另一个＆＃34; Diffuse2＆＃34;而不是＆＃34; Diffuse＆＃34;）

我知道如何将多个csv传递到我的脚本中，但是如何让csv仅将值传递给它们当前具有值的列？也许通过＆＃34; Nan＆＃34;到该行中的所有其他列。

理想情况下，我有类似的东西：

'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"

另一个警告是，需要将此数据框附加到。它需要保留，因为多个csv文件传递给它。我想我可能会在最后写出它自己的csv（它最终会转到NETCDF4）。

Answer 1

假设您有以下CSV文件：

test1.csv：

year,month,day,Direct 
1992,1,1,11
2013,5,30,11
2004,9,1,11

test2.csv：

year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203

test3.csv：

year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date

<强>解决方案：

import glob
import pandas as pd

files = glob.glob(r'd:/temp/test*.csv')

def get_merged(files, **kwargs):
    df = pd.read_csv(files[0], **kwargs)
    for f in files[1:]:
        df = df.merge(pd.read_csv(f, **kwargs), how='outer')
    return df

print(get_merged(files))

<强>输出：

   year  month  day  Direct   Direct  Direct2            File3
0  1992      1    1     11.0    21.0    201.0            text1
1  2013      5   30     11.0    21.0    202.0            text2
2  2004      9    1     11.0    21.0    203.0            text3
3  2016      1    1      NaN     NaN      NaN  unmatching_date

更新：通常惯用pd.concat(list_of_dfs)解决方案无法在此处工作，因为它通过索引加入：

In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
   Direct  Direct   Direct2            File3  day  month  year
0     NaN     11.0      NaN              NaN    1      1  1992
1     NaN     11.0      NaN              NaN   30      5  2013
2     NaN     11.0      NaN              NaN    1      9  2004
3    21.0      NaN    201.0              NaN    1      1  1992
4    21.0      NaN    202.0              NaN   30      5  2013
5    21.0      NaN    203.0              NaN    1      9  2004
6     NaN      NaN      NaN            text1    1      1  1992
7     NaN      NaN      NaN            text2   30      5  2013
8     NaN      NaN      NaN            text3    1      9  2004
9     NaN      NaN      NaN  unmatching_date    1      1  2016

In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
       0    1     2     3       4    5     6     7      8     9   10  11               12
0  1992.0  1.0   1.0  11.0  1992.0  1.0   1.0  21.0  201.0  1992   1   1            text1
1  2013.0  5.0  30.0  11.0  2013.0  5.0  30.0  21.0  202.0  2013   5  30            text2
2  2004.0  9.0   1.0  11.0  2004.0  9.0   1.0  21.0  203.0  2004   9   1            text3
3     NaN  NaN   NaN   NaN     NaN  NaN   NaN   NaN    NaN  2016   1   1  unmatching_date

或明确使用index_col=None：

In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
   Direct  Direct   Direct2            File3  day  month  year
0     NaN     11.0      NaN              NaN    1      1  1992
1     NaN     11.0      NaN              NaN   30      5  2013
2     NaN     11.0      NaN              NaN    1      9  2004
3    21.0      NaN    201.0              NaN    1      1  1992
4    21.0      NaN    202.0              NaN   30      5  2013
5    21.0      NaN    203.0              NaN    1      9  2004
6     NaN      NaN      NaN            text1    1      1  1992
7     NaN      NaN      NaN            text2   30      5  2013
8     NaN      NaN      NaN            text3    1      9  2004
9     NaN      NaN      NaN  unmatching_date    1      1  2016

In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
       0    1     2     3       4    5     6     7      8     9   10  11               12
0  1992.0  1.0   1.0  11.0  1992.0  1.0   1.0  21.0  201.0  1992   1   1            text1
1  2013.0  5.0  30.0  11.0  2013.0  5.0  30.0  21.0  202.0  2013   5  30            text2
2  2004.0  9.0   1.0  11.0  2004.0  9.0   1.0  21.0  203.0  2004   9   1            text3
3     NaN  NaN   NaN   NaN     NaN  NaN   NaN   NaN    NaN  2016   1   1  unmatching_date

以下更惯用的解决方案有效，但它会更改列和行/数据的原始顺序：

In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
     ...:
     ...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
     ...:
     ...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
     ...:
Out[224]:
   month  day  year  Direct   Direct  Direct2            File3
0      1    1  1992     11.0    21.0    201.0            text1
1      1    1  2016      NaN     NaN      NaN  unmatching_date
2      5   30  2013     11.0    21.0    202.0            text2
3      9    1  2004     11.0    21.0    203.0            text3

Answer 2

大熊猫不能自动解决这个问题吗？

http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append

如果您的指数重叠，请不要忘记添加'ignore_index = True'

Answer 3

首先，遍历所有文件以定义公共标题：

csv_path = './csv_files'
csv_separator = ','

full_headers = []
for fn in os.listdir(csv_path):
    with open(fn, 'r') as f:
        headers = f.readline().split(csv_separator)
        full_headers += full_headers + list(set(full_headers) - set(headers))

然后将标题行写入输出文件，然后再次运行所有文件以填充它。

您可以使用：csv.DictReader(open('myfile.csv'))来简单地将标题与其指定的列匹配。

在python中，将具有不同标头的多个CSV读取到一个数据帧中

3 个答案: