根据列值比较2个数据框

时间:2020-09-07 13:44:59

标签: python pandas dataframe

我在df1和df2中有一些数据。基于df1中的Interval列值,我想从df2中取出与df1中的间隔值匹配的特定StartEnd

df1:
ID     Interval
1      annual
2      quarterly
3      semiannual
df2:
ID  Start       End
1   AUG-FY21    JAN-FY22
1   AUG-FY21    OCT-FY21
1   AUG-FY21    JUL-FY22
2   AUG-FY21    JAN-FY22
2   AUG-FY21    OCT-FY21
3   AUG-FY21    JAN-FY22
3   AUG-FY21    OCT-FY21
3   AUG-FY21    JUL-FY22

output:
ID  Interval    Start       End
1   annual      AUG-FY21    JUL-FY22
2   quarterly   AUG-FY21    OCT-FY21
3   semiannual  AUG-FY21    JAN-FY22

2 个答案:

答案 0 :(得分:0)

在计算天数差异后使用熊猫合并两个数据框的解决方案,并随意定义间隔标签。

# reproduce the test case
import pandas as pd
data_1 = {'ID': [1, 2, 3],
          'Interval': ['annual', 'quarterly', 'semiannual']}
df1 = pd.DataFrame(data_1)
data_2 = {'ID': [1, 1, 1, 2, 2, 3, 3, 3],
          'Start': ['AUG-FY21', 'AUG-FY21', 'AUG-FY21', 'AUG-FY21', 'AUG-FY21', 'AUG-FY21', 'AUG-FY21', 'AUG-FY21'],
          'End': ['JAN-FY21', 'OCT-FY21', 'AUG-FY22', 'JAN-FY21', 'OCT-FY21', 'JAN-FY21', 'OCT-FY21', 'AUG-FY22']}
df2 = pd.DataFrame(data_2)

# compute the days interval based on start and stop
df2['Days_interval'] = (pd.to_datetime(df2.End.str.replace('-FY', ' 20')) - pd.to_datetime(df2.Start.str.replace('-FY', ' 20'))).abs().dt.days
df2['Interval'] = ''

# assign labels based on days interval
df2.loc[df2['Days_interval'] < 100, 'Interval'] = 'quarterly'
df2.loc[(df2['Days_interval'] >= 100) & (df2['Days_interval'] <= 300), 'Interval'] = 'semiannual'
df2.loc[df2['Days_interval'] > 300, 'Interval'] = 'annual'

# exclude helper columns
df2.drop('Days_interval', axis = 1, inplace = True)

# merge both dfs by ID and interval
output = pd.merge(df1, df2, how='inner', on = ['ID', 'Interval'])
# exclude helper columns from original df
df2.drop('Interval', axis = 1, inplace = True)

output
    ID  Interval    Start       End
0   1   annual      AUG-FY21    AUG-FY22
1   2   quarterly   AUG-FY21    OCT-FY21
2   3   semiannual  AUG-FY21    JAN-FY21

答案 1 :(得分:0)

您可以将StartEnd列转换为日期,获取它们之间的月份数,然后使用词典用所需的单词替换timedelta。合并日期时间并将其转换回字符串。

import pandas as pd

offsets= {11:'annual',
         2:'quarterly',
         5:'semiannual'}

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Interval': ['annual', 'quarterly', 'semiannual']})

df2 = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3, 3, 3],
 'Start': ['AUG-FY21','AUG-FY21','AUG-FY21','AUG-FY21','AUG-FY21','AUG-FY21','AUG-FY21','AUG-FY21'],
 'End': ['JAN-FY22','OCT-FY21','JUL-FY22','JAN-FY22','OCT-FY21','JAN-FY22','OCT-FY21','JUL-FY22']})


df2['Start'] =pd.to_datetime(df2['Start'], format='%b-FY%y')
df2['End'] =pd.to_datetime(df2['End'], format='%b-FY%y')
df2['Interval'] = df2.apply(lambda x: len(pd.date_range(start=x['Start'], end=x['End'], freq='M')), axis=1)

df2['Interval'] = df2['Interval'].replace(offsets)

output = df1.merge(df2, on=['ID','Interval'], how='left')

output['Start'] = output['Start'].dt.strftime(date_format='%b-FY%y').str.upper()
output['End'] = output['End'].dt.strftime(date_format='%b-FY%y').str.upper()

输出

    ID    Interval     Start         End
0    1      annual  AUG-FY21    JUL-FY22
1    2   quarterly  AUG-FY21    OCT-FY21
2    3  semiannual  AUG-FY21    JAN-FY22