遍历和合并具有相同索引,相同列(但是每个DataFrame唯一的几列)的DataFrames

时间:2020-01-26 11:26:09

标签: pandas loops dataframe merge concat

所需任务说明

我使用以下代码合并dfdf1(显示示例数据),它可以很好地满足我的需求。但是,我需要遍历大量的DataFrame(例如df2,但是将是df3df4等),并且不确定如何修改代码。我的DataFrame具有相同的索引,相同的列,但是每个DataFrame都有一些单独的列。我使用以下代码,并且效果很好,但是我希望对其进行修改,以便可以遍历dfdf1,将它们合并在一起,创建requireddata,然后在{{ 1}}与requireddata合并。 df2requireddata合并将继续相同的逻辑,依此类推。任何帮助都是极好的!! :)

df3

df

ID AA TA TL Date 2001 AAPL 1.0 44 50 2002 AAPL 3.0 33 51 2003 AAPL 2.0 22 53 2004 AAPL 5.0 11 76 2005 AAPL 2.0 33 44 2006 AAPL 3.0 22 12

df1

ID AA TA ML Date 2001 MSFT 3.5 44 12 2002 MSFT 6.7 33 15 2003 MSFT 2.3 22 19 2004 MSFT 5.5 11 20 2005 MSFT 2.2 33 43 2006 MSFT 3.2 22 23 示例

df2

使用的代码

       ID    AA  TA  PP
Date                      
2001  TSLA   3.3  48  18
2002  TSLA   6.3  38  18
2003  TSLA   2.6  28  18
2004  TSLA   5.3  18  28
2005  TSLA   2.3  38  48
2006  TSLA   3.3  28  28

创建此:

dfdates['Date'] # this has dates required for index
df
df1

cols_to_use = df.columns.difference(df1.columns) #compare column difference df and df1
cols_to_use1 = df1.columns.difference(df.columns) #compare column difference df1 and df

dataframe = pd.DataFrame(columns = cols_to_use, index = df['Date']) #dataframe with columns in df1 but not in df
dataframe1 = pd.DataFrame(columns = cols_to_use1, index = df1['Date']) #dataframe with columns in df but not in df1

datatesting = pd.concat([dataframe, df], axis=1) #merge missing columns into df
datatesting1 = pd.concat([dataframe1, df1], axis=1) #merge missing columns into df1

diff = datatesting1.columns.difference(datatesting.columns) #check difference (is 0)
print (diff)
frames = [datatesting, datatesting1] #list of dataframes 
requireddata = pd.concat(frames) #merge dataframes

使用循环代码,喜欢以下代码:

       ID    AA   TA   TL  ML
Date                      
2001  AAPL   1.0  44  50  NaN
2002  AAPL   3.0  33  51  NaN
2003  AAPL   2.0  22  53  NaN
2004  AAPL   5.0  11  76  NaN
2005  AAPL   2.0  33  44  NaN
2006  AAPL   3.0  22  12  NaN                    
2001  MSFT   3.5  44  NaN  12
2002  MSFT   6.7  33  NaN  15
2003  MSFT   2.3  22  NaN  19
2004  MSFT   5.5  11  NaN  20
2005  MSFT   2.2  33  NaN  43
2006  MSFT   3.2  22  NaN  23

1 个答案:

答案 0 :(得分:1)

我认为这里没有必要区分列,仅使用concat,列就正确对齐了:

df = pd.concat([df,df1,df2], sort=False)
print (df)
        ID   AA  TA    TL    ML    PP
Date                                 
2001  AAPL  1.0  44  50.0   NaN   NaN
2002  AAPL  3.0  33  51.0   NaN   NaN
2003  AAPL  2.0  22  53.0   NaN   NaN
2004  AAPL  5.0  11  76.0   NaN   NaN
2005  AAPL  2.0  33  44.0   NaN   NaN
2006  AAPL  3.0  22  12.0   NaN   NaN
2001  MSFT  3.5  44   NaN  12.0   NaN
2002  MSFT  6.7  33   NaN  15.0   NaN
2003  MSFT  2.3  22   NaN  19.0   NaN
2004  MSFT  5.5  11   NaN  20.0   NaN
2005  MSFT  2.2  33   NaN  43.0   NaN
2006  MSFT  3.2  22   NaN  23.0   NaN
2001  TSLA  3.3  48   NaN   NaN  18.0
2002  TSLA  6.3  38   NaN   NaN  18.0
2003  TSLA  2.6  28   NaN   NaN  18.0
2004  TSLA  5.3  18   NaN   NaN  28.0
2005  TSLA  2.3  38   NaN   NaN  48.0
2006  TSLA  3.3  28   NaN   NaN  28.0