Question

我有两个数据框。一个数据框 (A) 看起来像：

Name     begin    stop    ID      
Peter     30       150    1      
Hugo     4500     6000    2      
Jennie    300      700    3

另一个数据框 (B) 看起来像

entry     string      
89         aa      
568        bb     
938437     cc

我想在这里完成两个任务：

我想获取行（来自数据帧 B）的索引列表，其中 entry 列落在区间（由 begin 和 stop< 指定） /strong> 列）在数据框 A 中。此任务的结果将是：

lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.

我从任务 1 中获得的索引，我想将它从数据帧 B 中删除以创建一个新的数据帧。因此，新的数据框将如下所示：

entry string 938437 cc

我怎样才能完成这两项任务？

Answer 1

利用between()方法和tolist()方法获取索引列表：

lst=B[B['entry'].between(A.loc[0,'begin'],A.loc[len(A)-1,'stop'])].index.tolist()

最后使用 isin() 方法和布尔掩码：

result=B[~B.index.isin(lst)]

Answer 2

您可以使用merge_asof

l = (pd.merge_asof(dfB['entry'].reset_index() #to keep original index after merge
                      .sort_values('entry'), #mandatory to use this merge_asof
                   dfA[['begin','stop']].sort_values('begin'),
                   left_on='entry', right_on='begin',
                   direction='backward') # begin lower than entry
       .query('stop >= entry') # keep only where entry lower than stop
       ['index'].tolist()
    )
print(l)
# Int64Index([0, 1], dtype='int64')

new_df = dfB.loc[dfB.index.difference(l)]
print(new_df)
#     entry string
# 2  938437     cc

现在如果你不需要 list onf index 并且你真正的目标是 new_df，那么你可以直接做

new_df = (pd.merge_asof(dfB.sort_values('entry'), 
                        dfA[['begin','stop']].sort_values('begin'),
                        left_on='entry', right_on='begin',
                        direction='backward')
            .query('stop < entry') #here different inegality
            .drop(['begin','stop'], axis=1) #clean the result
            .reset_index(drop=True)
         )
print(new_df)

根据另一个数据框中的值从熊猫数据框中提取行

2 个答案: