Question

我需要将位于文件夹中的多个CSV文件合并为一个文件。

我的原始数据是这样的

y_1980.csv：

     country   y_1980
0        afg    196
1        ago    125
2        alb     23
3          .      .
.          .      .

y_1981.csv：

     country   y_1981
0        afg    192
1        ago    120
2        alb     0
3          .      .
.          .      .

y_20xx.csv：

     country   y_20xx
0        afg    176
1        ago    170
2        alb     76
3          .      .
.          .      .

我期望获得的是类似的东西：

     country   y_1980   y_1981   ...   y_20xx    
0        afg      196      192   ...      176
1        ago      125      120   ...      170
2        alb       23        0   ...       76
3          .        .        .   ...        .
.          .        .        .   ...        .

到目前为止，我的当前代码如下，但我得到的结果是数据帧在前一个之后合并：

interesting_files = glob.glob("/Users/Desktop/Data/*.csv") 

header_saved = True

with open('/Users/Desktop/Data/table.csv','wb') as fout:
    for filename in interesting_files:

        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

Answer 1

熊猫让这很容易。通过循环和合并，您可以简单地执行：

<强>代码：

import pandas as pd

files = ['file1', 'file2']
dfs = None
for filename in files:
    df = pd.read_csv(filename, sep='\s+')
    if dfs is None:
        dfs = df
    else:
        dfs = dfs.merge(df, how='outer')
    print(df)
print(dfs)
dfs.to_csv('file3', sep=' ')

<强>结果：

  country  y_1980
0     afg     196
1     ago     125
2     alb      23

  country  y_1981
0     afg     192
1     ago     120
2     alb       0

  country  y_1980  y_1981
0     afg     196     192
1     ago     125     120
2     alb      23       0

Answer 2

代码的顺序似乎如下：

打开文件＃1
如果未保存则写入标题
写入数据行
打开文件＃2
...等

将所有数据连接成一个文件。听起来你真的想加入专栏＆＃34; country＆＃34;代替

import glob
import pandas as pd
csvs = glob.glob("*.csv")
dfs = []

for csv in csvs:
  dfs.append(pd.read_csv(csv))

merged_df = dfs[0]

for df in dfs[1:]:
  merged_df = pd.merge(merged_df,df,on=['country'])


merged_df.to_csv('out.csv',index=False)

Answer 3

如果你使用熊猫会容易得多。原因是它将摆脱for-loop问题并保持memory footprint低。

import pandas as pd

# read the files first

y_1980 = pd.read_csv('y_1980.csv', sep='\t')
y_1981 = pd.read_csv('y_1981.csv', sep='\t')

如果值使用＆＃39;按空格分隔，则可以更改sep选项。＆＃39;或＆＃39;，＆＃39;逗号。

# set 'country' as the index to use this value to merge.
y_1980 = y_1980.set_index('country', append=True)
y_1981 = y_1981.set_index('country', append=True)

print(y_1980)
print(y_1981)

            y_1980
    country        
  0 afg         196
  1 ago         125
  2 alb          23


             y_1980
    country        
  0 afg         192
  1 ago         120
  2 alb           0

# set the frames to merge. You can add as many dataframe as you want.
frames =[y_1980, y_1981]

# now merge the dataframe
merged_df = pd.concat(frames, axis=1).reset_index(level=['country'])
print(result)

      country  y_1980  y_1980
0     afg     196     192
1     ago     125     120
2     alb      23       0

附加说明：如果您只想合并所有框架中的密钥，可以添加选项：how='inner' and drop=na。如果要合并所有帧中的所有可能数据，请使用how='outer'。

有关详细信息，请参阅此链接：http://pandas.pydata.org/pandas-docs/stable/merging.html

将csv文件（从文件夹）合并为一个，使用Python

3 个答案: