pd.merge:尝试合并具有相同列名称的数据框

时间:2019-12-20 14:39:30

标签: python pandas merge

我知道这是一个简单的问题,但是我已经停留了很长时间。 我有两个DataFrame,它们有上千行操作系统,但这是一个示例:

df1 = 

Name    Value    Date
x        0.04    2014-01-02
x        0.03    2014-01-03
x        0.02    2014-01_05
x        0.02    2014-01-07
(...)    (...)      (...)
y        0.002   2014-01-01
y        0.001   2014-01-02
y        0.003   2014-01-03
y        0.004   2014-01-07
(...)     (...)     (...)
z        0.003   2014-01-02
z        0.003   2014-01-05
z        0.004   2014-01-07
(...)     (...)      (...)

另外一个Dataframe

df2 = 

  Name    Value    Date
    x        0.04    2015-01-02
    x        0.03    2015-01-03
    x        0.02    2015-01_05
    x        0.02    2015-01-07
    (...)    (...)      (...)
    y        0.002   2015-01-01
    y        0.001   2015-01-02
    y        0.003   2015-01-03
    y        0.004   2015-01-07
    (...)     (...)     (...)
    z        0.003   2015-01-02
    z        0.003   2015-01-05
    z        0.004   2015-01-07
    (...)     (...)      (...)

我想要什么:

df3=
   Name    Value    Date
    x        0.04    2014-01-02
    x        0.03    2014-01-03
    x        0.02    2014-01_05
    x        0.02    2014-01-07
    x        0.04    2015-01-02
    x        0.03    2015-01-03
    x        0.02    2015-01_05
    x        0.02    2015-01-07
    (...)    (...)      (...)
    y        0.002   2014-01-01
    y        0.001   2014-01-02
    y        0.003   2014-01-03
    y        0.004   2014-01-07
    y        0.002   2015-01-01
    y        0.001   2015-01-02
    y        0.003   2015-01-03
    y        0.004   2015-01-07
    (...)     (...)     (...)
    z        0.003   2014-01-02
    z        0.003   2014-01-05
    z        0.004   2014-01-07
    z        0.003   2015-01-02
    z        0.003   2015-01-05
    z        0.004   2015-01-07
    (...)     (...)      (...)

1)我合并时,如果"name"在2014年数据中不存在,我希望它在我的df3中不存在,并且与2015年数据相同。

换句话说,我只希望在我的"Name"中都具有价值的Dataframes

我尝试过的:

a= df1.merge(df2,how="inner") 还有

frames= [df1,df2]
df3= pd.concat([frames],axis=1)

但是我得到的输出是:

df3 = 

Value_x     Date_y    Name    Value_y    Date_y 
  0.03    2014-01-02    x        0.04    2015-01-02
  0.02    2014-01-05    x        0.03    2015-01-03
  0.03    2014-01-06    x        0.02    2015-01_05
  0.03    2014-01-07    x        0.02    2015-01-07
  (...)     (...)     (...)      (...)     (...)    
   0.02   2014-01-03    y        0.002   2015-01-01
   0.01   2014-01-07    y        0.001   2015-01-02
   0.02   2014-01-06    y        0.003   2015-01-03
   00.2   2014-01-07    y        0.004   2015-01-07
  (...)     (...)     (...)      (...)     (...)
   0.03   2014-01-02   z        0.003   2015-01-02
   0.01   2014-01-04   z        0.003   2015-01-05
   0.03   2014-01-05   z        0.004   2015-01-07
  (...)      (...)     (...)     (...)   (...)

3 个答案:

答案 0 :(得分:0)

使用pd.append:您可以

#...

df = df1.append(df2, ignore_index=True)

# or more dfs list
df = df1.append([df2, df3], ignore_index=True)

有关更多信息,请参见文档https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

答案 1 :(得分:0)

你能尝试

df3 = pd.merge(df1, df2, left_on='Value', right_on='Value')

答案 2 :(得分:0)

如果我对您的理解正确,则希望匹配2014年和2015年的。如果2014年或2015年缺少某天,则该日期不应该出现在结果框中。

请注意,在此示例中,我将2014-01-08作为名称z添加到df1中-它不会出现在最终数据框中,因为2015-01-08不存在df2中的这个名称):

import pandas as pd

name_1 = ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'z']
value_1 = [0.04, 0.03, 0.02, 0.02, 0.002, 0.001, 0.003, 0.004, 0.003, 0.003, 0.004, 0.009]
date_1 = ['2014-01-02', '2014-01-03', '2014-01-05', '2014-01-07', '2014-01-01', '2014-01-02', '2014-01-03', '2014-01-07', '2014-01-02', '2014-01-05', '2014-01-07', '2014-01-08']

name_2 = ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z']
value_2 = [0.04, 0.03, 0.02, 0.02, 0.002, 0.001, 0.003, 0.004, 0.003, 0.003, 0.004]
date_2 = ['2015-01-02', '2015-01-03', '2015-01-05', '2015-01-07', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-07', '2015-01-02', '2015-01-05', '2015-01-07']

df1 = pd.DataFrame({'Name':name_1, 'Value':value_1, 'Date': date_1})
df2 = pd.DataFrame({'Name':name_2, 'Value':value_2, 'Date': date_2})

df1['days'] = df1['Date'].str.split(r'\d{4}-(\d+-\d+)', expand=True)[1]
df2['days'] = df2['Date'].str.split(r'\d{4}-(\d+-\d+)', expand=True)[1]

df = pd.merge( df1,  df2, on=['Name', 'days'] )

df = df1[ df1.set_index( ['Name', 'Date'] ).index.isin( df.set_index( ['Name', 'Date_x']).index ) ].append(
        df2[ df2.set_index( ['Name', 'Date'] ).index.isin( df.set_index( ['Name', 'Date_y']).index ) ]
    ).sort_values(['Name', 'Date']).reset_index(drop=True)
del df['days']

print(df)

打印:

   Name  Value        Date
0     x  0.040  2014-01-02
1     x  0.030  2014-01-03
2     x  0.020  2014-01-05
3     x  0.020  2014-01-07
4     x  0.040  2015-01-02
5     x  0.030  2015-01-03
6     x  0.020  2015-01-05
7     x  0.020  2015-01-07
8     y  0.002  2014-01-01
9     y  0.001  2014-01-02
10    y  0.003  2014-01-03
11    y  0.004  2014-01-07
12    y  0.002  2015-01-01
13    y  0.001  2015-01-02
14    y  0.003  2015-01-03
15    y  0.004  2015-01-07
16    z  0.003  2014-01-02
17    z  0.003  2014-01-05
18    z  0.004  2014-01-07
19    z  0.003  2015-01-02
20    z  0.003  2015-01-05
21    z  0.004  2015-01-07