我有2个数据框:
PROJECT1
key name deadline delivered
0 AA1 Tom 01/05/2018 02/05/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA4 Jack 01/05/2018 04/05/2018
PROJECT2
key name deadline delivered
0 AA1 Tom 01/05/2018 30/04/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA3 Jim 01/05/2018 03/05/2018
可以在PROJECT2中创建名为'In PROJECT1'的列,并应用如下条件:
伪代码
for row in PROJECT2:
if in the same row based on key column PROJECT1['delivered'] >= PROJECT2['deadline']:
PROJECT2['In PROJECT1'] = 'project delivered before deadline'
else:
'Project delayed'
预期结果
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN
不确定如何处理它(iterrows(),用于循环,df.loc [conditions],np.where(),或者也许我需要定义某种函数以在df.apply()中使用),任何帮助表示赞赏。
答案 0 :(得分:1)
您可以使用 SELECT district_name AS district_name,
facility_name AS facility_name,
encounter_date AS __timestamp,
SUM(facility_encounters.measles_1) + SUM(facility_encounters.mr_1) AS "Measles + MR 1",
SUM(facility_encounters.mr_2) + SUM(facility_encounters.measles_2) AS "Measles + MR 2",
SUM(facility_encounters.bcg_1) AS sum__bcg_1,
SUM(facility_encounters.rota_1) AS sum__rota_1,
SUM(facility_encounters.rota_2) AS sum__rota_2,
SUM(facility_encounters.opv_1) AS sum__opv_1,
SUM(facility_encounters.opv_2) AS sum__opv_2,
SUM(facility_encounters.pentavalent_1) AS sum__pentavalent_1,
SUM(facility_encounters.pentavalent_2) AS sum__pentavalent_2,
SUM(facility_encounters.pentavalent_3) AS sum__pentavalent_3,
SUM(facility_encounters.opv_3) AS sum__opv_3,
SUM(facility_encounters.opv_4) AS sum__opv_4,
SUM(facility_encounters.pcv_1) AS sum__pcv_1,
SUM(facility_encounters.pcv_2) AS sum__pcv_2,
SUM(facility_encounters.pcv_3) AS sum__pcv_3
FROM facility_encounters
WHERE encounter_date >= '2018-01-01 00:00:00'
AND encounter_date <= '2018-12-31 00:00:00'
GROUP BY district_name,
facility_name,
encounter_date
ORDER BY "Measles + MR 1" DESC
LIMIT 10000;
添加带有条件和值列表的序列。
请注意,我相信您的期望标准已经颠倒了,即在截止日期之前交付应该赋予“在截止日期之前交付项目”,而不是相反。
numpy.select
答案 1 :(得分:0)
这里是将两个数据集合并在一起的替代方法。这将帮助您避免循环的任何必要,并且速度更快。
## join the two data sets
# p1 = Project 1
# p2 = Project 2
p3 = p2.merge(p1.loc[:,['key','delivered']], on='key',how='left', suffixes=['_p2','_p1'])
p3['In PROJECT1'] = np.where((p3['delivered_p1'] >= p3['delivered_p2']),'project delivered before deadline','Project delayed')
# handle cases with NA
set_to_na = p3[['delivered_p1','delivered_p2']].isnull().any(axis=1).values.tolist()
p3['In PROJECT1'].iloc[set_to_na] = np.nan
## remove unwanted columns and rename
p3.drop('delivered_p1', axis=1, inplace=True)
p3.rename(columns={'delivered_p2':'delivered'}, inplace=True)
print(p3)
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN