我有两个具有深度间隔数据和属性的数据框
df1
ID Top Bottom Value_1
A 0 2 CC
A 2 8 DD
A 10 15 EE
B 3 20 FF
df2
ID Top Bottom Value_2
A 0 4 XX
A 4 6 YY
A 8 20 ZZ
B 0 10 NN
B 10 50 MM
我想使用熊猫将它们组合成一个新的数据框,该数据框的每个ID的间隔最小,并用组合后的值创建一列(基于最小间隔)。像下面一样
df_combine
ID Top Bottom Value_1 Value_2 Value_combined
A 0 2 CC XX CC
A 2 4 DD XX XX
A 4 6 DD YY YY
A 6 8 DD - DD
A 8 10 - ZZ ZZ
A 10 15 EE ZZ EE
A 15 20 - ZZ ZZ
B 0 3 - NN NN
B 3 10 FF NN NN
B 10 20 FF MM FF
B 20 50 - MM MM
答案 0 :(得分:0)
我建议执行以下步骤:
import pandas as pd
import numpy as np
1 /分解每个数据框(每行重复从下至上的次数)
def explode_df(df, value):
df['interval'] = df['Bottom'] - df['Top'] + 1
df = pd.DataFrame(df.values.repeat(df.Bottom - df.Top, axis=0), columns=df.columns)
df['Top'] = df['Top'] + df.groupby(['ID','Top','Bottom']).cumcount()
return df[['ID', 'Top', value, 'interval']]
df1_expl = explode_df(df1, 'Value_1')
df2_expl = explode_df(df2, 'Value_2')
df1_expl.head(10)
[Out]:
ID Top Value_1 interval
0 A 0 CC 3
1 A 1 CC 3
2 A 2 DD 7
3 A 3 DD 7
4 A 4 DD 7
5 A 5 DD 7
6 A 6 DD 7
7 A 7 DD 7
8 A 10 EE 6
9 A 11 EE 6
2 /将df1_expl和df2_expl合并到ID和顶部
dfcomb_expl = pd.merge(df1_expl, df2_expl, on=['ID','Top'], how='outer')\
.fillna(0).sort_values(['ID','Top']).reset_index(drop=True)
dfcomb_expl.head(10)
[Out]:
ID Top Value_1 interval_x Value_2 interval_y
0 A 0.0 CC 3 XX 5
1 A 1.0 CC 3 XX 5
2 A 2.0 DD 7 XX 5
3 A 3.0 DD 7 XX 5
4 A 4.0 DD 7 YY 3
5 A 5.0 DD 7 YY 3
6 A 6.0 DD 7 0 0
7 A 7.0 DD 7 0 0
8 A 8.0 0 0 ZZ 13
9 A 9.0 0 0 ZZ 13
3 /总计
a)为要保留的每一行创建等于1的列
dfcomb_expl['keep'] = ((dfcomb_expl['ID'] != dfcomb_expl['ID'].shift()) \
| (dfcomb_expl['Value_1'] != dfcomb_expl['Value_1'].shift()) \
| (dfcomb_expl['Value_2'] != dfcomb_expl['Value_2'].shift()))\
.astype(int)
b)筛选出要保留的行
dfcomb = dfcomb_expl[dfcomb_expl['keep']==1].reset_index()
c)根据第1步计算的间隔,计算底部和 Value_combined
dfcomb['Bottom'] = dfcomb['Top'] + dfcomb_expl.groupby(dfcomb_expl['keep'].cumsum())['ID'].transform('count')
dfcomb['Value_combined'] = np.where(dfcomb['interval_x'] == 0, dfcomb['Value_2'], \
np.where(dfcomb['interval_y'] == 0, dfcomb['Value_1'], \
np.where(dfcomb['interval_x'] < dfcomb['interval_y'], dfcomb['Value_1'], dfcomb['Value_2'])))
df_combine = dfcomb[['ID','Top','Bottom','Value_1','Value_2','Value_combined']]
df_combine
[Out]:
ID Top Bottom Value_1 Value_2 Value_combined
0 A 0.0 2.0 CC XX CC
1 A 2.0 4.0 DD XX XX
2 A 4.0 6.0 DD YY YY
3 A 6.0 8.0 DD 0 DD
4 A 8.0 10.0 0 ZZ ZZ
5 A 10.0 12.0 EE ZZ EE
6 A 15.0 17.0 0 ZZ ZZ
7 B 0.0 2.0 0 NN NN
8 B 3.0 5.0 FF NN NN
9 B 10.0 12.0 FF MM FF
10 B 20.0 25.0 0 MM MM
4 /备注 此解决方案未考虑特殊情况,即“底部”>“顶部”,以及对于特定ID(在df1或df2中)的情况,我们有两行包含相同的范围。