根据python

时间:2018-12-21 15:49:07

标签: python

我有一个像这样的文本文件:

示例:

>chr9:128683-128744
GGATTTCTTCTTAGTTTGGATCCATTGCTGGTGAGCTAGTGGGATTTTTTGGGGGGTGTTA
>chr16:134222-134283
AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG
>chr16:134226-134287
GGAAGCAGCGTGGGAATCACAGAATGGACGGCCGATTAAAGGCTTTGCTTGGCCTGGATTT
>chr1:134723-134784
AAGTGATTCACCCTGCCTTTCCGACCTTCCCCAGAACAGAACACGTTGATCGTGGGCGATA
>chr16:135770-135831
GCCTGAGCAAAGGGCCTGCCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTT

此文件分为不同的部分,每个部分都有2行。第一行以>开头(此行称为ID),第二行为字母序列。 我想在字母序列中搜索2个短主题(AATAAAGGAC),如果它们包含这些主题,我想获取该部分的ID和序列。 但关键是AATAAA应该是第一个序列,而GGAC将在此之后。它们之间有一段距离,但是这个距离可以是2个字母或更多。

预期输出:

>chr16:134222-134283
AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG

我正在尝试使用以下命令在python中做到这一点:

infile = open('infile.txt', 'r')
mot1 = 'AATAAA'
mot2 = 'GGAC'
new = []
for line in range(len(infile)):
    if not infile[line].startswith('>'):
        for match in pattern.finder(mot1) and pattern.finder(mot2):
            new.append(infile[line-1])


with open('outfile.txt', "w") as f:
    for item in new:
        f.write("%s\n" % item)

此代码不返回我想要的。你知道如何解决吗?

5 个答案:

答案 0 :(得分:0)

您可以按顺序将ID分组,然后使用re.findall

import re
data = [i.strip('\n') for i in open('filename.txt')]
new_data = [[data[i], data[i+1]] for i in range(0, len(data), 2)]
final_result = [[a, b] for a, b in new_data if re.findall('AATAAA\w{2,}GGAC', b)]

输出:

[['>chr16:134222-134283', 'AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG']]

答案 1 :(得分:0)

不确定我对this distance can be 2 letters or more有什么了解,是否必须检查,但以下代码可为您提供所需的输出:

mot1 = 'AATAAA'
mot2 = 'GGAC'

with open('infile.txt', 'r') as inp:
    last_id = None
    for line in inp:
        if line.startswith('>'):
            last_id = line
        else:
            if mot1 in line and mot2 in line:
                print(last_id)
                print(line)

您可以根据需要将输出重定向到文件

答案 2 :(得分:0)

您可以使用正则表达式和字典理解:

import re

with open('test.txt', 'r') as f:
    lines = f.readlines()
    data = dict(zip(lines[::2],lines[1::2]))

{k.strip(): v.strip() for k,v in data.items() if re.findall(r'AATAAA\w{2,}GGAC', v)}

返回:

{'>chr16:134222-134283': 'AGCTGGAAGCAGCGTGAATAAAACAGAATGGCCGGGACCTTAAAGGCTTTGCTTGGCCTGG'}

答案 3 :(得分:0)

如果在字符串中找到mot1,则可以对字符串的不相关部分进行切片。这是一种实现方法:

from math import ceil

infile = open('infile.txt', 'r')
text = infile.readlines()
infile.close()

mot1 = 'AATAAA'
mot2 = 'GGAC'

check = [(text[x], text[x+1]) for x in range(ceil(len(text)/2))]

result = [(x + '\n' + y) for (x, y) in check if mot1 in y and mot2 in y[(y.find(mot1)+len(mot1)+2):]]

with open('outfile.txt', "w") as f:
    for item in result:
        f.write("%s\n" % item)

答案 4 :(得分:0)

如果文件不是太大,您可以立即读取它,然后使用re.findall():

    import re
    with open("infile.txt") as finp:
        data=finp.read()
    with open('outfile.txt', "w") as f:
        for item in re.findall(r">.+?[\r\n\f][AGTC]*?AATAAA[AGTC]{2,}GGAC[AGTC]*", data):
            f.write(item+"\n")

"""
+? and *?       means non-greedy process;
>.+?[\r\n\f]    matches a line starting with '>' and followed by any characters to the end of the line; 
[AGTC]*?AATAAA  matches any number of A,G,T,C characters, followed by the AATAAA pattern; 
[AGTC]{2,}      matches at least two or more characters of A,G,T,C;
GGAC            matches the GGAC pattern;
[AGTC]*         matches the empty string or any number of A,G,T,C characters.
"""