不同数据帧之间的条件

时间:2018-03-30 16:46:32

标签: python pandas conditional

            A   B    C
0  2002-01-13  18  120
1  2002-01-13   7  150
2  2002-01-13  11  130
3  2002-01-13  26  140
4  2002-01-14  13  180
5  2002-01-14  25  165
6  2002-01-14   9  150
7  2002-01-14   4  190

我有df

我应用此代码:

df2 = df.loc[df['B'].sub(10).abs().groupby(df['A']).idxmin()]

df2中的哪个结果:

            A   B    C
2  2002-01-13  11  130
6  2002-01-14   9  150

现在我想创建一个新的df3,在df中选择满足下一个条件的行,每个A组:

  • df["C"] = df2["C"] + 20(如果2002-01-13组,则为130 + 20 = 150)。
  • 如果df行中不存在满足df["C"] = df2["C"] + 20的行,则取第一个较低的值(如果是2002-01-14组,则为150 + 20 = 170.由于170没有如果存在,选择下一个较低,则表示165)。

df3输出应为:

            A   B    C
1  2002-01-13   7  150
5  2002-01-14  25  165

2 个答案:

答案 0 :(得分:2)

您可以使用merge_asof

pd.merge_asof(df1.sort_values('C'),df2.assign(C=df.C+20).sort_values('C'),on='C',by='A',direction ='forward').dropna().drop_duplicates('A',keep='last')
Out[553]: 
            A  B_x    C   B_y
3  2002-01-13    7  150  11.0
5  2002-01-14   25  165   9.0

更新

pd.merge_asof(df1.sort_values('C').reset_index(),df2.assign(C=df2.C+20).sort_values('C'),on='C',by='A',direction ='forward').dropna().drop_duplicates('A',keep='last').set_index('index')
Out[606]: 
                A  B_x    C   B_y
index                            
1      2002-01-13    7  150  11.0
5      2002-01-14   25  165   9.0

答案 1 :(得分:0)

使用lambda和if语句。用于获取索引然后拉取值。如果+20的匹配不到C + 20以下的最大值。

完整代码复制示例改进:

将pandas导入为pd

# build op data frame
df = pd.DataFrame(columns=['A', 'B', 'C'])
A = [pd.Timestamp('2002-01-13'), pd.Timestamp('2002-01-13'), pd.Timestamp('2002-01-13'), pd.Timestamp('2002-01-13'),
     pd.Timestamp('2002-01-14'), pd.Timestamp('2002-01-14'), pd.Timestamp('2002-01-14'), pd.Timestamp('2002-01-14')]
B = [18, 7, 11, 26, 13, 25, 9, 4]
C = [120, 150, 130, 140, 180, 165, 150, 190]

df['A'] = A
df['B'] = B
df['C'] = C
print(df)

# build df2
df2 = df.loc[df['B'].sub(10).abs().groupby(df['A']).idxmin()]
print(df2)

# find indices in df that meet op criteria
df_ind = df2.apply(lambda row: ((df.A == row.A) & (df.C == row.C+20)).idxmax() if ((df.A == row.A) & (df.C == row.C+20)).sum() > 0 else (df.C.loc[(df.C < row.C+20) & (df.A == row.A)]).idxmax(), axis=1)
print(df_ind)

2    1
6    5

# Build df3
df3 = df.loc[df_ind.tolist(), :]
print(df3)

结果:

           A   B    C
1 2002-01-13   7  150
5 2002-01-14  25  165