有条件地遍历染色体和一个数据帧中的位置到染色体和其他数据帧中的间隔

时间:2019-07-13 05:13:27

标签: python pandas

df1= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
         'position':[50, 500, 1030, 2005 , 3575,50, 250]})
df2 = pd.DataFrame({'Chr':['1', '1', '1', '1',           
  '1','2','2','2','2','2','3','3','3','3','3'],
             'start':  
[0,100,1000,2000,3000,0,100,1000,2000,3000,0,100,1000,2000,3000],
             'end': 
 [100,1000,2000,3000,4000,100,1000,2000,3000,4000,100,1000,2000,3000,4000],
             'logr':[3, 4, 5, 6, 7,8,9,10,11,12,13,15,16,17,18],
             'seg':[0.2,0.5,0.2,0.1,0.5,0.5,0.2,0.2,0.1,0.2,0.1,0.5,0.5,0.9,0.3]})

我想有条件地循环df1中的'Chr'和'position'到'Chr'和df2中的间隔(df1中的位置介于'start'和'end'之间),然后添加'logr'和' df1中的seg'column

我想要的输出是:

df3= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
         'position':[50, 500, 1030, 2005 , 3575,50, 250],
           'logr':[3, 4, 10,11, 18,13, "NA"],
             'seg':[0.2,0.5,0.2,0.1,0.3,0.1,"NA"]})

谢谢。

3 个答案:

答案 0 :(得分:2)

DataFrame.merge与外部联接一起用于所有组合,然后按Series.betweenboolean indexing进行过滤,并使用DataFrame.pop进行提取,最后左联接用于添加缺少的行:

df3 = df1.merge(df2, on='Chr', how='outer')
#between is by default inclusive (>=, <=) orwith parameter inclusive=False (>, <)
df3 = df3[df3['position'].between(df3.pop('start'), df3.pop('end'))]
#if need one inclusive and  another interval not (e.g. >, <=)
#df3 = df3[(df3['position'] > df3.pop('start')) & (df3['position'] <= df3.pop('end'))]
df3 = df1.merge(df3, how='left')
print (df3)
  Chr  position  logr  seg
0   1        50   3.0  0.2
1   1       500   4.0  0.5
2   2      1030  10.0  0.2
3   2      2005  11.0  0.1
4   3      3575  18.0  0.3
5   3        50  13.0  0.1
6   4       250   NaN  NaN

另一种解决方案:

df3 = df1.merge(df2, on='Chr', how='outer')
s = df3.pop('start')
e = df3.pop('end')
df3 = df3[df3['position'].between(s, e) | s.isna() | e.isna()]
#if different closed intervals
#df3 = df3[(df3['position'] > s) & (df3['position'] <= e) | s.isna() | e.isna()]
print (df3)
   Chr  position  logr  seg
0    1        50   3.0  0.2
6    1       500   4.0  0.5
12   2      1030  10.0  0.2
18   2      2005  11.0  0.1
24   3      3575  18.0  0.3
25   3        50  13.0  0.1
30   4       250   NaN  NaN

答案 1 :(得分:0)

尝试使用pd.merge()np.where()

import pandas pd
import numpy as np
res_df = pd.merge(df1,df2,on=['Chr'],how='outer')

res_df['check_between'] = np.where((res_df['position']>=res_df['start'])&(res_df['position']<=res_df['end']),True,False)

df3 = res_df[(res_df['check_between']==True) |
              (res_df['start'].isnull())|
              (res_df['end'].isnull()) ]

df3.drop(['check_between','start','end'],axis=1,inplace=True)

   Chr  position    logr    seg
0   1   50           3.0    0.2
6   1   500          4.0    0.5
12  2   1030         10.0   0.2
18  2   2005         11.0   0.1
24  3   3575         18.0   0.3
25  3   50           13.0   0.1
30  4   250          NaN    NaN

答案 2 :(得分:0)

left-mergeindicator=True。接下来,query检查positionstartend之间的_merge的值为left_only。最后,删除不需要的列

df1.merge(df2, 'left', indicator=True).query('(start<=position<=end) | _merge.eq("left_only")') \
                                      .drop(['start', 'end', '_merge'],1)

Out[364]:
   Chr  position  logr  seg
0    1        50   3.0  0.2
6    1       500   4.0  0.5
12   2      1030  10.0  0.2
18   2      2005  11.0  0.1
24   3      3575  18.0  0.3
25   3        50  13.0  0.1
30   4       250   NaN  NaN