我桌子的前15行看起来像这样:
ch S-MART utr5 313 516 . + . blabla
ch GenBank gene 517 1878 . + 1 ID=SAOUHSC_00001;Name=SAOUHSC_00001
ch GenBank mRNA 517 1878 . + 1 ID=SAOUHSC_00001.t01;Parent=SAOUHSC_00001
ch GenBank CDS 517 1878 . + 1 ID=SAOUHSC_00001.p01;Name=SAOUHSC_00001;product=chromosomal
ch GenBank exon 517 1878 . + 1 Parent=SAOUHSC_00001.t01
ch S-MART gene 517 1878 . + . blabla
ch S-MART operon 1879 2155 . + . blabla
ch GenBank gene 2156 3289 . + 1 ID=SAOUHSC_00002;Name=SAOUHSC_00002
ch GenBank mRNA 2156 3289 . + 1 ID=SAOUHSC_00002.t01;Parent=SAOUHSC_00002
ch GenBank CDS 2156 3289 . + 1 ID=SAOUHSC_00002.p01;Parent=SAOUHSC_00002.t01;Name=SAOUHSC_00002;product=DNA polymerase
ch S-MART utr3 3290 3331 . + . blabla
ch S-MART utr5 3649 3669 . + . blabla
ch GenBank gene 3670 3915 . + 1 ID=SAOUHSC_00003;Name=SAOUHSC_00003
ch GenBank CDS 3670 3915 . + 1 ID=SAOUHSC_00003.p01;Parent=SAOUHSC_00003.t01;Name=SAOUHSC_00003;product=conserved
ch S-MART gene 3670 5024 . + . blabla
此类标准文件称为GFF,供参考,其定义可在此处找到http://www.sequenceontology.org/gff3.shtml。
此表的第四和第五列表示开始和停止坐标。我试图根据这些开始和停止坐标缩小此表。
我使用以下代码执行此操作:
import pandas as pd
staphInputGff = pd.read_table("myTable", sep='\t',names=["seqid", "source", "type","start","end","score","strand","phase","attributes"])
start = staphInputGff.start
end = staphInputGff.end
def consolidate(start, end):
_start = start[:] # Make a copy since we're modifying the list
result = []
for i in range(len(_start)-1): # Iterate until the second-to-last pair
if _start[i+1] <= end[i]+1 and (start[i+1] != _start[i] or end[i+1] != end[i]): # If two pairs are contiguous,
_start[i+1] = _start[i] # replace the start value with the previous one
elif start[i+1] != _start[i] or end[i+1] != end[i]: # Otherwise
result.append((_start[i], end[i])) # add the current pair to the result
result.append((_start[i+1], end[i+1])) # the ultimate pair
return result
consolidate(start,end)
它给出了以下结果:
(313, 3331), (3649, 5024)
并排连续的开始和结束坐标。
例如,第一行有开始和结束坐标(313,516)。由于下一行有开始和结束坐标(517,1878),因此将其合并为(313,1878)。这将继续几行,坐标为(1879,2155),再为(3290,3331)。最终结果是(313,3331)
我需要打印一张特定键的表 - 只有CDS就足够了(我不想要其他行的信息)。
仅对于CDS命名条目,最终结果应如下所示
ch GenBank CDS 313 3331 . + . ID=SAOUHSC_00001.p01;Name=SAOUHSC_00001;product=chromosomal
ch GenBank CDS 3649 5024 . + . ID=SAOUHSC_00003.p01;Parent=SAOUHSC_00003.t01;Name=SAOUHSC_00003;product=conserved
要做到这一点,我想我需要在Pandas中使用合并,连接,连接或布尔索引或分组。
什么是正确的方法?
答案 0 :(得分:1)
这是一个可能但非常难看的解决方案。这基本上是一系列黑客攻击。希望它会让一些真正知道他们做得足够多的人发布一个更好的解决方案。
# start with the data in a data frame df
df
a b c Start End f g h i
0 ch S-MART utr5 313 516 . + . blabla
1 ch GenBank gene 517 1878 . + 1 ID=SAOUHSC_00001;Name=SAOUHSC_00001
2 ch GenBank mRNA 517 1878 . + 1 ID=SAOUHSC_00001.t01;Parent=SAOUHSC_00001
3 ch GenBank CDS 517 1878 . + 1 ID=SAOUHSC_00001.p01;Name=SAOUHSC_00001;produc...
4 ch GenBank exon 517 1878 . + 1 Parent=SAOUHSC_00001.t01
5 ch S-MART gene 517 1878 . + . blabla
6 ch S-MART operon 1879 2155 . + . blabla
7 ch GenBank gene 2156 3289 . + 1 ID=SAOUHSC_00002;Name=SAOUHSC_00002
8 ch GenBank mRNA 2156 3289 . + 1 ID=SAOUHSC_00002.t01;Parent=SAOUHSC_00002
9 ch GenBank CDS 2156 3289 . + 1 ID=SAOUHSC_00002.p01;Parent=SAOUHSC_00002.t01;...
10 ch S-MART utr3 3290 3331 . + . blabla
11 ch S-MART utr5 3649 3669 . + . blabla
12 ch GenBank gene 3670 3915 . + 1 ID=SAOUHSC_00003;Name=SAOUHSC_00003
13 ch GenBank CDS 3670 3915 . + 1 ID=SAOUHSC_00003.p01;Parent=SAOUHSC_00003.t01;...
14 ch S-MART gene 3670 5024 . + . blabla
# If there are duplicate start/end values, discard those that repeat the CDS
cdsSE = df[df['c']=='CDS'][['Start','End']].values
droprows = []
for idx in df.index:
if (df.loc[idx][['Start','End']].values in cdsSE) and (df.loc[idx]['c'] != 'CDS'):
droprows.append(idx)
df2 = df.drop(df.index[droprows])
# Walk through the new data frame. If the next row is contiguous,
# move up the start value and mark the row for deletion
droprows = []
for i in range(len(df2.index[:-1])):
if (df2.iloc[i]['End'] + 1) >= df2.iloc[i+1]['Start']: # to include lesser start values
# If the present row is CDS,
#save its information by also moving it up a row
if df2.loc[df2.index[i], 'c'] == 'CDS':
df2.loc[df2.index[i+1], ['a', 'b', 'c', 'f', 'g', 'h', 'i']] = df2.loc[df2.index[i], ['a', 'b', 'c', 'f', 'g', 'h', 'i']]
# Then replace the start values with the present row
# and mark the row for deletion
df2.loc[df2.index[i+1], 'Start'] = df2.loc[df2.index[i], 'Start']
droprows.append(i)
# And make a new data frame by deleting the unwanted rows
df3 = df2.drop(df2.index[droprows])
df3
a b c Start End f g h i
10 ch GenBank CDS 313 3331 . + 1 ID=SAOUHSC_00001.p01;Name=SAOUHSC_00001;produc...
13 ch GenBank CDS 3649 3915 . + 1 ID=SAOUHSC_00003.p01;Parent=SAOUHSC_00003.t01;...