感谢您之前的建议,
我有另一个正则表达式问题:
现在我有一个这种模式的列表:
*7 3 279 0
*33 2 254 0.0233918128654971
*39 2 276 0.027431421446384
和Fasta格式的DNA测序文件:
编辑重新格式化的行
>OCTU1
GCTTGTCTCAAAGATTAAGCCATGCATGTATAAGCACAAGCCTAAAATGGTGAAGCCGCGAATAGCTCATTACAACAGTCGTAGTTTATTGGAAAGTTCACTATGGATAACTGTGGTAATTCTAGAGCTAATACATGTTCCAATCCTCGACTCACGGAGAGGTGCATTTATTAGAACAAAGCTGATCAGACTATGTCTGTCTCAGGTTGACTCTGAATAACTTTGCTAATCGCACAGTCTTTGTACTGGCGATGTATCTTTCATGCTATGTA
>OCTU2
GCTGCTTCCTTGGATGTGGTAGCCGTTTCTCAGGCTCCCTCTCCGGAATCGAACCCTATTCCCCGTTACCCGTTCAACCATGGTAGGCCCTACTACCATCAAAGTTGATAGGGCAGATATTTGAAAGACATCGCCGCACAAAGGCTATGCGATTAGCAAAGTTATTAGATCAACGACGCAGCGATCGGCTTTGACTAATAAATCACCCCTCCAGTTGGGGACTTTTACATGTATTAGCTCTAGAATTACCACAGTTATCCATTAGTGAAGTACCTTCCAATAAACTATACTGTTTAATGAGCCATTCGCGGTTTCACCGTAAAATTAGGTTGTCTTAGACATGCATGGCTTAATCTTTGTAGACAAGC
我需要在Fasta文件中找到带有*(例如,7或33)的列表中的数字(例如,> OCTU7和> OCTU33),并在另一个文件中仅复制Fasta序列列表中有这个,这是我的脚本:
regex=re.compile(r'.+\d+\s+')
OCTU=b.readlines()
while OCTU:
for line in a:
if regex.match(OCTU)==line:
c.write(OCTU)
脚本似乎有效,但我认为模式不正确,因为创建的文件是空的。
提前感谢您的宝贵意见。
答案 0 :(得分:1)
您可以先将文件a
中的ID号收集到一个集合中,以便以后快速查找:
seta = set()
regexa = re.compile(r'\*(\d+)') #matches asterisk followed by digits, captures digits
for line in a:
m = regexa.match(line) #looks for match at start of line
if m:
seta.add(m.group(1))
然后循环遍历b。在循环中使用b.next()
来获取序列所在的第二行。
regexb = re.compile(r'>OCTU(\d+)') #matches ">OCTU" followed by digits, captures digits
for line in b:
m = regexb.match(line)
if m:
sequence = b.next()
if m.group(1) in seta:
c.write(line)
c.write(sequence)
答案 1 :(得分:0)
您可能希望使用Biopython来解析fasta文件。
然后你可以切出数字并在列表中查找并更可靠地访问序列和序列名称...如果一个fasta文件有换行符,上面的方法可能会遇到问题......
import collections
from Bio import SeqIO
infile = "yourfastafile.fasta"
outfile = "desired_outfilename.fasta"
dct = collections.OrderedDict()
for record in SeqIO.parse(open(infile), "fasta"):
dct[record.description()] = str(record.seq).upper()
for k,v in dct.items():
if int(k[4:]) in seta: #from answer above
with open(outfile, "a") as handle:
handle.write(">" + k + "\n" + str(v) + "\n")
答案 2 :(得分:0)
import re
regex = r">.+\n[acgtnACGTN\n]+"
test_str = (">AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368\n"
"ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC\n"
"CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC\n"
"CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG\n"
"AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC\n"
"CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG\n"
"TTTAATTACAGACCTGAA")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))