我想合并两个数据框并计算3个列,这些列是过去观察GIVEN ID的一部分。
以下是一个例子:
contracts_data = np.array([
[1, '2015-01-01', 15000],
[2, '2015-01-01', 1500],
[1, '2015-08-01', 16000],
[2, '2015-08-01', 1800],
[1, '2015-10-01', 17000],
[1, '2016-01-01', 18000],
[1, '2016-03-01', 20000]])
historique_data = np.array([[1, '2015-01-01'],
[2, '2015-01-01'],
[1, '2015-02-01'],
[2, '2015-02-01'],
[1, '2015-03-01'],
[2, '2015-03-01'],
[1, '2015-04-01'],
[2, '2015-04-01'],
[1, '2015-05-01'],
[2, '2015-05-01'],
[1, '2015-06-01'],
[2, '2015-06-01'],
[1, '2015-07-01'],
[2, '2015-07-01'],
[1, '2015-08-01'],
[2, '2015-08-01'],
[1, '2015-09-01'],
[2, '2015-09-01'],
[1, '2015-10-01'],
[2, '2015-10-01'],
[1, '2015-11-01'],
[2, '2015-11-01'],
[1, '2015-12-01'],
[2, '2015-12-01'],
[1, '2016-01-01'],
[2, '2016-01-01'],
[1, '2016-02-01'],
[2, '2016-02-01'],
[1, '2016-03-01'],
[2, '2016-03-01'],
[1, '2016-04-01'],
[2, '2016-04-01'],
[1, '2016-05-01'],
[2, '2016-05-01']])
historique_data_expected = np.array([[1, '2015-01-01', 15000],
[2, '2015-01-01', 1500],
[1, '2015-02-01', 15000],
[2, '2015-02-01', 1500],
[1, '2015-03-01', 15000],
[2, '2015-03-01', 1500],
[1, '2015-04-01', 15000],
[2, '2015-04-01', 1500],
[1, '2015-05-01', 15000],
[2, '2015-05-01', 1500],
[1, '2015-06-01', 15000],
[2, '2015-06-01', 1500],
[1, '2015-07-01', 15000],
[2, '2015-07-01', 1500],
[1, '2015-08-01', 15500],
[2, '2015-08-01', 1650],
[1, '2015-09-01', 15500],
[2, '2015-09-01', 1650],
[1, '2015-10-01', 16000],
[2, '2015-10-01', 1650],
[1, '2015-11-01', 16000],
[2, '2015-11-01', 1650],
[1, '2015-12-01', 16000],
[2, '2015-12-01', 1650],
[1, '2016-01-01', 16500],
[2, '2016-01-01', 1650],
[1, '2016-02-01', 16500],
[2, '2016-02-01', 1650],
[1, '2016-03-01', 17200],
[2, '2016-03-01', 1650],
[1, '2016-04-01', 17200],
[2, '2016-04-01', 1650],
[1, '2016-05-01', 17200],
[2, '2016-05-01', 1650]])
我想加入3个数据集,而对于Salary列,我希望获得相同ID的过去工资的平均值。
如何使用pandas和numpy甚至另一个框架来完成这项工作。
提前致谢。
=====更新====
我在这里添加了一个更简单的两个数据帧示例和预期结果:
ID DATE SALARY
1 2015-01-01 1500
2 2015-01-01 1000
1 2015-03-01 1600
1 2015-04-01 1700
ID DATE
1 2015-01-01
2 2015-01-01
1 2015-02-01
2 2015-02-01
1 2015-03-01
2 2015-03-01
1 2015-04-01
2 2015-04-01
预期结果:
ID DATE
1 2015-01-01 1500
2 2015-01-01 1000
1 2015-02-01 1500
2 2015-02-01 1000
1 2015-03-01 1550
2 2015-03-01 1000
1 2015-04-01 1600
2 2015-04-01 1000
在其他方面,我想在合并2个数据集时平均过去的工资
答案 0 :(得分:1)
考虑条件运行平均值的应用函数:
from io import StringIO
import pandas as pd
import numpy as np
data = '''
ID DATE SALARY
1 2015-01-01 1500
2 2015-01-01 1000
1 2015-03-01 1600
1 2015-04-01 1700
'''
df1 = pd.read_table(StringIO(data), sep="\s+", parse_dates=[1])
data = '''
ID DATE
1 2015-01-01
2 2015-01-01
1 2015-02-01
2 2015-02-01
1 2015-03-01
2 2015-03-01
1 2015-04-01
2 2015-04-01
'''
df2 = pd.read_table(StringIO(data), sep="\s+", parse_dates=[1])
df = pd.merge(df1, df2, on=['ID', 'DATE'], how='outer').sort_values('DATE')\
.reset_index(drop=True)
df['AVGSALARY'] = df.apply(lambda x: np.mean(df[(df['ID'] == x['ID']) & \
(df['DATE'] <= x['DATE'])]['SALARY']), axis=1)
print(df)
# ID DATE SALARY AVGSALARY
# 0 1.0 2015-01-01 1500.0 1500.0
# 1 2.0 2015-01-01 1000.0 1000.0
# 2 1.0 2015-02-01 NaN 1500.0
# 3 2.0 2015-02-01 NaN 1000.0
# 4 1.0 2015-03-01 1600.0 1550.0
# 5 2.0 2015-03-01 NaN 1000.0
# 6 1.0 2015-04-01 1700.0 1600.0
# 7 2.0 2015-04-01 NaN 1000.0