将具有不同列名的Dataframe与聚合列值合并

时间:2017-01-17 12:57:10

标签: python pandas numpy

合并两个数据帧:我有两个数据帧需要合并某些条件,但我还没有弄清楚如何做到这一点?

df1 : 

id            positive_action    date             volume  
id_1          user 1                  2016-12-12       19720.735
              user 2                  2016-12-12       14740.800

df2 :
id            negative_action        date             volume  
id_1          user 1                  2016-12-12       10.000
              user 3                  2016-12-12       10.000     

I want : 

id            action        date             volume  
id_1          user 1         2016-12-12       19730.735
              user 2         2016-12-12       14740.800   
              user 3         2016-12-12       10.000 

这里

  1. 卷跨两个数据框汇总
  2. 合并ID,日期和(积极行动和否定行动合并在一起)
  3. 我如何实现这一目标?

3 个答案:

答案 0 :(得分:3)

您还可以在将positive_action和negative_action列重命名为action之后连接您的DataFrame,然后执行groupby。

df1.rename(columns={'positive_action':'action'}, inplace=True)
df2.rename(columns={'negative_action':'action'}, inplace=True)
pd.concat([df1, df2]).groupby(['id', 'action', 'date']).sum().reset_index()


     id  action        date     volume
0  id_1  user 1  2016-12-12  19730.735
1  id_1  user 2  2016-12-12  14740.800
2  id_1  user 3  2016-12-12     10.000

答案 1 :(得分:2)

这应该有效:

# not sure what indexing you are using so lets remove it
# to get on the same page, so to speak ;).
df1 = df1.reset_index()
df2 = df2.reset_index()

# do an outer merge to allow mismatches on the actions.
df = df1.merge(
    df2, left_on=['id', 'positive_action', 'date'],
    right_on=['id', 'negative_action', 'date'],
    how='outer',
)


# fill the missing actions from one with the other.
# (Will only happen when one is missing due to the way we merged.)
df['action'] = df['positive_action'].fillna(df['negative_action'])

# drop the old actions
df = df.drop('positive_action', 1)
df = df.drop('negative_action', 1)

# aggregate the volumes (I'm assuming you mean a simple sum)
df['volume'] = df['volume_x'].fillna(0) + df['volume_y'].fillna(0)

# drop the old volumes
df = df.drop('volume_x', 1)
df = df.drop('volume_y', 1)

print(df)

输出结果为:

     id        date     volume  action
0  id_1  2016-12-12  19730.735  user_1
1  id_1  2016-12-12  14740.800  user_2
2  id_1  2016-12-12     10.000  user_3

然后,您可以恢复我可能已删除的索引。

答案 2 :(得分:2)

  • set_index
  • 上要“合并”的列上
  • rename_axis因为当我们add时,如果我们的索引级别不一致,就会让熊猫哭泣。
  • pd.Series.add与参数fill_value=0
  • 一起使用 再次使用所需名称
  • rename_axis
  • reset_index,您正在开展业务
v1 = df1.set_index(['positive_action', 'date']).volume.rename_axis([None, None])
v2 = df2.set_index(['negative_action', 'date']).volume.rename_axis([None, None])
v1.add(v2, fill_value=0).rename_axis(['action', 'date']).reset_index()

   action       date     volume
0  user 1 2016-12-12  19730.735
1  user 2 2016-12-12  14740.800
2  user 3 2016-12-12     10.000

设置

df1 = pd.DataFrame(dict(
        positive_action=['user 1', 'user 2'],
        date=pd.to_datetime(['2016-12-12', '2016-12-12']),
        volume=[19720.735, 14740.800]
    ))

df2 = pd.DataFrame(dict(
        negative_action=['user 1', 'user 3'],
        date=pd.to_datetime(['2016-12-12', '2016-12-12']),
        volume=[10, 10]
    ))