我想裁剪此gi|1168222|sp|P46098.1|5HT3A_HUMAN
以获取此P46098
,但是对于任何此类序列gi|"RANDOM"|sp|"SEQUENCE"|"RANDOM"
。
这是一个例子:
gi|1168222|sp|P46098.1|5HT3A_HUMAN
gi|1168223|sp|P35563.2|5HT3A_RAT
gi|112809|sp|P23979.1|5HT3A_MOUSE
gi|24211440|sp|O70212.1|5HT3A_CAVPO
gi|113067|sp|P22770|ACHA7_CHICK
如果没有sp|
,我只想在.
和|
或.
之间进行此操作。这就是我现在所得到的:
from Bio import SeqIO
import re
handle = open("seqdumpsp.txt", "rU")
for record in SeqIO.parse(handle, "fasta") :
line = record.id
i1 = line.index('sp|')
i2 = line.index('.')
line = line.replace(line[:i1], '', line)
line = line.replace(x[i2:], '')
print line
handle.close()
然而这不起作用,因为我不能在替换中使用i1和i2。
答案 0 :(得分:2)
>>> line = 'gi|1168222|sp|P46098.1|5HT3A_HUMAN'
>>> line.split('|')
['gi', '1168222', 'sp', 'P46098.1', '5HT3A_HUMAN']
>>> line.split('|')[3]
'P46098.1'
>>> line.split('|')[3].split('.')
['P46098', '1']
>>> line.split('|')[3].split('.')[0]
'P46098'
答案 1 :(得分:1)
你可以说line.split('|')[3]
。
答案 2 :(得分:1)
按字符串处理:
for
循环遍历内容的每一行。<强>演示强>:
content = """gi|1168222|sp|P46098.1|5HT3A_HUMAN
gi|1168223|sp|P35563.2|5HT3A_RAT
gi|112809|sp|P23979.1|5HT3A_MOUSE
gi|24211440|sp|O70212.1|5HT3A_CAVPO
gi|113067|sp|P22770|ACHA7_CHICK"""
result = []
for line in content.split("\n"):
start_index = line.find("sp|")
if start_index==-1:
continue
#- +3 because lenght of sp| is 3
end_index1 = line.find(".", start_index+3)
end_index2 = line.find("|", start_index+3)
if end_index1==-1 and end_index2==-1:
continue
elif end_index2==-1:
end_index = end_index1
elif end_index1==-1:
end_index = end_index2
elif end_index1 < end_index2:
end_index = end_index1
else:
end_index = end_index2
result.append(line[start_index+3:end_index])
print result
输出:
['P46098', 'P35563', 'P23979', 'O70212', 'P22770']
通过 CSV
演示:
import csv
input_file = "dp-input1.csv"
with open(input_file) as fp:
root = csv.reader(fp, delimiter='|')
result = [row[3].split(".")[0] for row in root]
#for row in root:
# tmp = row[3].split(".")[0]
# result.append(tmp)
print "Final result:-", result
输出:
Final result:- ['P46098', 'P35563', 'P23979', 'O70212', 'P22770']
答案 3 :(得分:1)
您可以使用re.search
:
lines = """gi|1168222|sp|P46098.1|5HT3A_HUMAN
gi|1168223|sp|P35563.2|5HT3A_RAT
gi|112809|sp|P23979.1|5HT3A_MOUSE
gi|24211440|sp|O70212.1|5HT3A_CAVPO
gi|113067|sp|P22770|ACHA7_CHICK
"""
import re
r = re.compile("(?<=\|sp\|)\w+")
for s in lines.splitlines():
print(r.search(s).group(0))
P46098
P35563
P23979
O70212
P22770