我有以下数据框:
id start end score
C1 2 592 157
C1 179 592 87
C1 113 553 82
C2 152 219 350
C2 13 70 319
C2 13 70 188
C2 15 70 156
C2 87 139 130
C2 92 140 102
C3 18 38 348
C3 20 35 320
C3 31 57 310
C4 347 51 514
数据按ID和分数排序。
id表示DNA序列。
开始和结束代表id中的位置,我想保持不重叠的切片,并且从重叠只有最高排名:
id start end score
C1 2 592 157
C2 152 219 350
C2 13 70 319
C2 87 139 130
C3 18 38 348
C4 347 51 514
有什么想法吗?
由于
答案 0 :(得分:1)
这个更短,符合所有要求。你需要:
所有这些,通过使用逻辑和groupby
# from Ned Batchfelder
# http://nedbatchelder.com/blog/201310/range_overlap_in_two_compares.html
def overlap(start1, end1, start2, end2):
"""
Does the range (start1, end1) overlap with (start2, end2)?
"""
return end1 >= start2 and end2 >= start1
def compare_rows(group):
winners = []
skip = []
if len(group) == 1:
return group[['start', 'end', 'score']]
for i in group.index:
if i in skip:
continue
for j in group.index:
last = j == group.index[-1]
istart = group.loc[i, 'start']
iend = group.loc[i, 'end']
jstart = group.loc[j, 'start']
jend = group.loc[j, 'end']
if overlap(istart, iend, jstart, jend):
winner = group.loc[[i, j], 'score'].idxmax()
if winner == j:
winners.append(winner)
skip.append(i)
break
if last:
winners.append(i)
return group.loc[winners, ['start', 'end', 'score']].drop_duplicates()
grouped = df.groupby('id')
print grouped.apply(compare_rows)
答案 1 :(得分:1)
这是一个较短的版本
这只是为了让跑步变得简单。
import pandas as pd
import numpy as np
import StringIO as sio
data = """
id,start,end,score
C1,2,592,157
C1,179,592,87
C1,113,553,82
C2,152,219,350
C2,13,70,319
C2,13,70,188
C2,15,70,156
C2,87,139,130
C2,92,140,102
C3,18,38,348
C3,20,35,320
C3,31,57,310
C4,347,51,514"""
data = pd.read_csv(sio.StringIO(data))
下一个块完成工作。
data['range'] = data.end - data.start
data.sort_values(['id','range'])
g = data.groupby('id')
def f(df):
keep = []
while df.shape[0] > 0:
widest = df.iloc[0]
nested = (df.start >= widest.start) & (df.end <= widest.end)
retain = df.loc[nested]
loc = retain.score.values.argmax()
keep.append(retain.iloc[[loc]])
df = df.loc[np.logical_not(nested)]
return pd.concat(keep,0)
out = g.apply(f).drop('range', 1)
out.index = np.arange(out.shape[0])
使用上面的数据,输出
In[3]: out
Out[3]:
id start end score
0 C1 2 592 157
1 C2 152 219 350
2 C2 13 70 319
3 C2 87 139 130
4 C2 92 140 102
5 C3 18 38 348
6 C3 31 57 310
7 C4 347 51 514