根据另一个数据框上的日期条件从一个数据框中删除行

时间:2019-02-16 03:46:57

标签: python pandas

我有以下数据框 df1

id        date_col      No. of leaves
100       2018-10-05      4
100       2018-10-14      4
100       2018-10-19      4
100       2018-11-15      4
101       2018-10-05      3
101       2018-10-08      3
101       2018-12-05      3

df2

id        date_col       leaves_availed
100       2018-11-28       2
100       2018-11-29       2
101       2018-11-19       2
101       2018-11-24       2

我想要df1中具有特定ID和日期的行比df2中具有特定ID的日期小的行,然后删除日期最早的行,并且 从“树叶数”中减去leaves_availed的数量。

在上面的示例中,结果数据帧应为

id        date_col      No. of leaves
100       2018-10-19      2
100       2018-11-15      2
101       2018-12-05      1

对于df2中id = 100和日期为2018-11-28的日期小于2018-11-28的行为

id        date_col      No. of leaves
100       2018-10-05      4
100       2018-10-14      4
100       2018-10-19      4
100       2018-11-15      4

,该子集中最早的日期是2018-10-05 因此,行100 2018-10-05 4将被删除,依此类推

现在,我已经对两个数据框进行了排序

df1.sort_values(by=['id','date_col'],inplace=True)
df2.sort_values(by=['id','date_col'],inplace=True)

并且iam尝试根据df2中的行数删除df1中的前几行,但这无济于事

1 个答案:

答案 0 :(得分:0)

遵循逻辑,但不测试所有异常

import pandas as pd

def process(row):
    return row['No. of leaves'] - df2.iloc[0]['leaves_availed']

#recreate the different dataframe"
id1 = pd.DataFrame({'id': [100, 100, 100, 100, 101, 101, 101]})
il1 = pd.DataFrame({'No. of leaves': [4, 4, 4, 4, 3, 3, 3]})
id2 = pd.DataFrame({'id': [100, 100, 101, 101]})
il2 = pd.DataFrame({'leaves_availed': [2, 2, 2, 2]})
df1 = pd.DataFrame({'year': [2018, 2018, 2018, 2018, 2018, 2018, 2018],
                   'month': [10,   10,   10,   11,   10,   10,   12],
                     'day': [5,    14,   19,   15,    5,   8,    5]})    
df2 = pd.DataFrame({'year': [2018, 2018, 2018, 2018],
                   'month': [11,   11,   11,   11],
                     'day': [28,   29,   19,   24]})   
df1 = pd.Series(pd.to_datetime(df1, format='%Y-%m-%d')).to_frame()
df1.columns = ["date_col"]
df1 = pd.concat([id1, df1, il1], axis=1)
df2 = pd.Series(pd.to_datetime(df2, format='%Y-%m-%d')).to_frame()
df2.columns = ["date_col"]
df2 = pd.concat([id2, df2, il2], axis=1)    
df1.sort_values(by=['id','date_col'],inplace=True)
df2.sort_values(by=['id','date_col'],inplace=True)
#end of creation dafaframes

#loop each row of df2
for i in range(0, len(df2)):
    #filtering the df
    df3 = df1[(df1["date_col"] < df2.iloc[i]["date_col"]) & (df1['id'] == df2.iloc[i]['id']) ] 
    df3 = df3.iloc[1:]  #delete the oldest
    df3['No. of leaves'] = df3.apply(lambda row: process(row), axis = 1) #calculus the new leaves
    print(F"result for date {df2.iloc[i]['date_col']} and id =  {df2.iloc[i]['id']}")
    print(df3);print('-----------------\n')

显示的最终结果

result for date 2018-11-28 00:00:00 and id =  100
    id   date_col  No. of leaves
1  100 2018-10-14              2
2  100 2018-10-19              2
3  100 2018-11-15              2
-----------------
result for date 2018-11-29 00:00:00 and id =  100
    id   date_col  No. of leaves
1  100 2018-10-14              2
2  100 2018-10-19              2
3  100 2018-11-15              2
-----------------
result for date 2018-11-19 00:00:00 and id =  101
    id   date_col  No. of leaves
5  101 2018-10-08              1
-----------------
result for date 2018-11-24 00:00:00 and id =  101
    id   date_col  No. of leaves
5  101 2018-10-08              1
-----------------