根据条件在列中应用值,同时交叉评估2个数据集

时间:2018-07-15 13:49:33

标签: python pandas numpy dataframe

我有2个数据框:

PROJECT1

  key   name   deadline     delivered
0 AA1   Tom    01/05/2018   02/05/2018
1 AA2   Sue    01/05/2018   30/04/2018
2 AA4   Jack   01/05/2018   04/05/2018

PROJECT2

  key   name   deadline     delivered
0 AA1   Tom    01/05/2018   30/04/2018
1 AA2   Sue    01/05/2018   30/04/2018
2 AA3   Jim    01/05/2018   03/05/2018

可以在PROJECT2中创建名为'In PROJECT1'的列,并应用如下条件:

伪代码

for row in PROJECT2: 
    if in the same row based on key column PROJECT1['delivered'] >= PROJECT2['deadline']:
        PROJECT2['In PROJECT1'] = 'project delivered before deadline'
    else: 
        'Project delayed'

预期结果

  key   name   deadline     delivered    In PROJECT1
0 AA1   Tom    01/05/2018   30/04/2018   Project delayed
1 AA2   Sue    01/05/2018   30/04/2018   project delivered before deadline
2 AA3   Jim    01/05/2018   03/05/2018   NaN

不确定如何处理它(iterrows(),用于循环,df.loc [conditions],np.where(),或者也许我需要定义某种函数以在df.apply()中使用),任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:1)

您可以使用 SELECT district_name AS district_name, facility_name AS facility_name, encounter_date AS __timestamp, SUM(facility_encounters.measles_1) + SUM(facility_encounters.mr_1) AS "Measles + MR 1", SUM(facility_encounters.mr_2) + SUM(facility_encounters.measles_2) AS "Measles + MR 2", SUM(facility_encounters.bcg_1) AS sum__bcg_1, SUM(facility_encounters.rota_1) AS sum__rota_1, SUM(facility_encounters.rota_2) AS sum__rota_2, SUM(facility_encounters.opv_1) AS sum__opv_1, SUM(facility_encounters.opv_2) AS sum__opv_2, SUM(facility_encounters.pentavalent_1) AS sum__pentavalent_1, SUM(facility_encounters.pentavalent_2) AS sum__pentavalent_2, SUM(facility_encounters.pentavalent_3) AS sum__pentavalent_3, SUM(facility_encounters.opv_3) AS sum__opv_3, SUM(facility_encounters.opv_4) AS sum__opv_4, SUM(facility_encounters.pcv_1) AS sum__pcv_1, SUM(facility_encounters.pcv_2) AS sum__pcv_2, SUM(facility_encounters.pcv_3) AS sum__pcv_3 FROM facility_encounters WHERE encounter_date >= '2018-01-01 00:00:00' AND encounter_date <= '2018-12-31 00:00:00' GROUP BY district_name, facility_name, encounter_date ORDER BY "Measles + MR 1" DESC LIMIT 10000; 添加带有条件和值列表的序列。

请注意,我相信您的期望标准已经颠倒了,即在截止日期之前交付应该赋予“在截止日期之前交付项目”,而不是相反。

numpy.select

答案 1 :(得分:0)

这里是将两个数据集合并在一起的替代方法。这将帮助您避免循环的任何必要,并且速度更快。

## join the two data sets
#  p1 = Project 1
#  p2 = Project 2
p3 = p2.merge(p1.loc[:,['key','delivered']], on='key',how='left', suffixes=['_p2','_p1'])
p3['In PROJECT1'] = np.where((p3['delivered_p1'] >= p3['delivered_p2']),'project delivered before deadline','Project delayed')

# handle cases with NA
set_to_na = p3[['delivered_p1','delivered_p2']].isnull().any(axis=1).values.tolist()
p3['In PROJECT1'].iloc[set_to_na] = np.nan

## remove unwanted columns and rename
p3.drop('delivered_p1', axis=1, inplace=True)
p3.rename(columns={'delivered_p2':'delivered'}, inplace=True)

print(p3)

   key name    deadline   delivered                        In PROJECT1
0  AA1  Tom  01/05/2018  30/04/2018                    Project delayed
1  AA2  Sue  01/05/2018  30/04/2018  project delivered before deadline
2  AA3  Jim  01/05/2018  03/05/2018                                NaN