关于SO的第一篇文章,我对Python还是很陌生,所以如果这个问题微不足道/已经被回答了(我找不到它,我很抱歉)。
我有一个由基因座标组成的熊猫数据框df
,格式如下:
chrom start end
0 chr22 10510357 10510357
1 chr22 10512304 10512304
2 chr22 10516109 10516109
3 chr22 10516111 10516111
4 chr22 10516129 10516129
5 chr22 10516130 10516130
6 chr22 10516131 10516131
7 chr22 10516133 10516133
8 chr22 10516161 10516161
9 chr22 10516162 10516162
10 chr22 10516163 10516163
11 chr22 10516164 10516164
12 chr22 10516165 10516165
13 chr22 10516166 10516166
14 chr22 10516167 10516167
15 chr22 10516168 10516168
16 chr22 10516169 10516169
17 chr22 10516170 10516170
18 chr22 10516171 10516171
19 chr22 10516172 10516172
我想做的是合并行,其中前一行的“结束”位置与当前行的“开始”位置相距1个碱基对,希望以这样的方式结束:
chrom start end
0 chr22 10510357 10510357
1 chr22 10512304 10512304
2 chr22 10516109 10516109
3 chr22 10516111 10516111
4 chr22 10516129 10516129
5 chr22 10516130 10516133
6 chr22 10516161 10516172
我一直在研究一个仅由chr22中的位置组成的小型测试数据集,但是对于我的实际脚本,我将使用整个基因组,因此检查相邻位置的染色体是否也相同重要。到目前为止,这是我一直没有尝试过的方法:
for i in range(0, len(df)-1):
if df.loc[i, 'chrom'] == df.loc[i+1, 'chrom'] and df.loc[i, 'end'] == df.loc[i+1, 'start']:
df.loc[i, 'end'] = df.loc[i+1, 'end']
在此先感谢大家,感谢您的帮助/指导!
答案 0 :(得分:0)
我假设“ 1个碱基对离开”表示当前行的开始位置等于前一行的结束位置值加1。
import pandas as pd
#Function to find the end position of consecutive rows
def findEnd(df, index):
while index < len(df)-1:
if(df.iloc[index]['end']+1) == df.iloc[index+1]['start']:
index+=1
else: return(df.iloc[index]['end'], index)
return (df.iloc[index]['end'], index)
lst = []
i = 0
genLen = len(df)
#Traverse entire dataframe
while i < genLen:
#Check if we have at least one more row
if i < genLen-1:
#Check the next row is the same chrom
if(df.iloc[i]['chrom'] == df.iloc[i+1]['chrom']):
start = df.iloc[i]['start']
end,i = findEnd(df,i)
lst.append([df.iloc[i]['chrom'],start,end])
else:
#if the next row is a different
lst.append(list(df.iloc[i]))
elif i == genLen -1:
lst.append(list(df.iloc[i]))
i+=1
chrom = pd.DataFrame(lst,columns=['chrom','start','end'])
答案 1 :(得分:0)
尝试一下:
df = pd.DataFrame([[1],[3],[4],[5],[7], [11],[12],[13],[14],[18]])
df_end = df[~((df[0].shift(0) == df[0].shift(-1)-1))]
df_start = df[~((df[0].shift(0) == df[0].shift(+1)+1))]
for start, end in zip(df_start[0], df_end[0]):
print (start, end)