在循环中合并DataFrame

时间:2016-06-01 18:21:29

标签: python-2.7 pandas

我有一个包含大量csv文件的文件夹,如下所示:

csv1

        2006    Percent       Land_Use
    0     13   5.379564      Developed
    1      8  25.781580  Grass/Pasture
    2      4  54.265050           Crop
    3     15   0.363983          Water
    4     16   6.244104       Wetlands
    5      6   4.691764         Forest
    6      1   3.031494        Alfalfa
    7     11   0.137424      Shrubland
    8      5   0.003671          Vetch
    9      3   0.055412         Barren
    10     7   0.009531          Grass
    11    12   0.036423           Tree

CSV2

   2007    Percent       Land_Use
0     13   2.742430      Developed
1      4  56.007242           Crop
2      8  24.227963  Grass/Pasture
3     16   8.839979       Wetlands
4      6   6.181062         Forest
5      1   1.446668        Alfalfa
6     15   0.366116          Water
7      3   0.127760         Barren
8     11   0.034426      Shrubland
9      7   0.000827          Grass
10    12   0.025528           Tree

csv3

    2008    Percent       Land_Use
0    13   1.863809      Developed
1     8  31.455578  Grass/Pasture
2     4  57.896856           Crop
3    16   2.693929       Wetlands
4     6   4.417966         Forest
5     1   1.239176        Alfalfa
6     7   0.130849          Grass
7    15   0.266536          Water
8    11   0.004571      Shrubland
9     3   0.030731         Barren

我希望将它们全部合并到Land_Use

上的一个DataFrame中

我正在读这样的文件:

pth = (r'G:\') 
for f in os.listdir(pth):
df=pd.read_csv(os.path.join(pth,f)

但我无法弄清楚如何合并所有单独的数据帧。我想出了如何连接它们,但这不是我想要的。我想要的合并类型是outer

如果我要使用每个csv文件的路径,我会像这样合并它们,但我不想设置每个文件的路径,因为它们有很多:

    one=pd.read_csv(r'G:\one.csv')
    two=pd.read_csv(r'G:\two.csv')
    three=pd.read_csv(r'G:\three.csv')
    merge=pd.merge(one,two, on=['Land_Use'], how='outer')
    mergetwo=pd.merge(merge,three,on=['Land_Use'], how='outer')

2 个答案:

答案 0 :(得分:2)

我认为你可以在python 3中使用:

import functools

dfs = [df1,df2,df3]

df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
print (df)
    2006  Percent_x       Land_Use  2007  Percent_y  2008    Percent
0     13   5.379564      Developed  13.0   2.742430  13.0   1.863809
1      8  25.781580  Grass/Pasture   8.0  24.227963   8.0  31.455578
2      4  54.265050           Crop   4.0  56.007242   4.0  57.896856
3     15   0.363983          Water  15.0   0.366116  15.0   0.266536
4     16   6.244104       Wetlands  16.0   8.839979  16.0   2.693929
5      6   4.691764         Forest   6.0   6.181062   6.0   4.417966
6      1   3.031494        Alfalfa   1.0   1.446668   1.0   1.239176
7     11   0.137424      Shrubland  11.0   0.034426  11.0   0.004571
8      5   0.003671          Vetch   NaN        NaN   NaN        NaN
9      3   0.055412         Barren   3.0   0.127760   3.0   0.030731
10     7   0.009531          Grass   7.0   0.000827   7.0   0.130849
11    12   0.036423           Tree  12.0   0.025528   NaN        NaN

python 2

df = reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)

使用glob的工作解决方案:

import pandas as pd
import functools
import glob

pth = 'a/*.csv'
files = glob.glob(pth)
dfs = [pd.read_csv(f, sep=';') for f in files]

df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use', how='outer'),dfs)
print (df)
    2006  Percent_x       Land_Use  2008  Percent_y  2007    Percent
0     13   5.379564      Developed  13.0   1.863809  13.0   2.742430
1      8  25.781580  Grass/Pasture   8.0  31.455578   8.0  24.227963
2      4  54.265050           Crop   4.0  57.896856   4.0  56.007242
3     15   0.363983          Water  15.0   0.266536  15.0   0.366116
4     16   6.244104       Wetlands  16.0   2.693929  16.0   8.839979
5      6   4.691764         Forest   6.0   4.417966   6.0   6.181062
6      1   3.031494        Alfalfa   1.0   1.239176   1.0   1.446668
7     11   0.137424      Shrubland  11.0   0.004571  11.0   0.034426
8      5   0.003671          Vetch   NaN        NaN   NaN        NaN
9      3   0.055412         Barren   3.0   0.030731   3.0   0.127760
10     7   0.009531          Grass   7.0   0.130849   7.0   0.000827
11    12   0.036423           Tree   NaN        NaN  12.0   0.025528

答案 1 :(得分:1)

我不被评论,所以我不确定你到底想要什么。 您可以尝试使用one.merge(two, on=['Land_Use'], how='outer').merge(three,on=['Land_Use'], how='outer')。如果你想要别的东西,请告诉我。

如果您有许多数据帧,可以尝试使用reduce函数。首先创建一个包含所有数据框的列表dataframes = [one, two, three, four, ... , twenty]您可以使用列表推导将它们添加到列表中,或者将它们附加到循环中的列表中。 然后,如果您想基于Land_Use组合它们,可以使用df_final = reduce(lambda left,right: pd.merge(left,right,on=['Land_Use'], how='outer'), dataframes)

注意:reduce函数位于python 3 +

中的functools包中