使用正则表达式从fasta格式返回元组或列表

时间:2014-12-06 11:22:46

标签: python regex

我有一个如下所示的文本文件:

>gene_name_1
FYCVLAHWG
GGGGGGGGG
>gene_name_2
FYCVLAHWG
>gene_name_3
FYCVLAHWG
>gene_name_4
FYCVLAHG
>gene_name_5
FCVLAHWG
>gene_name_6
YCVLAHWG

我做了以下事情:

from sys import argv
import re

script, input_file = argv

opened_file = open(input_file).read()

test = re.findall('>.*\n|\n.*\n>', opened_file)

print test

我得到以下内容:

['>gene_name_1\n', '\nGGGGGGGGG\n>', '\nFYCVLAHWG\n>', '\nFYCVLAHWG\n>', '\nFYCVLAHG\n>', '\nFCVLAHWG\n>']

但我希望得到以下内容:

['>gene_name_1\n', '\nGGGGGGGGG\nFYCVLAHWG\n>', '>gene_name_2\n', '\nFYCVLAHWG\n>', '>gene_name_3\n', '\nFYCVLAHG\n>', '>gene_name_4\n', '\nFCVLAHWG\n>', '>gene_name_5\n', '\nFYCVLAHWG\n>', '>gene_name_6\n', '\nYCVLAHWG\n>]

为什么剩下的东西都丢失了?

2 个答案:

答案 0 :(得分:1)

s = r""">gene_name_1
FYCVLAHWG
GGGGGGGGG
>gene_name_2
FYCVLAHWG
>gene_name_3
FYCVLAHWG
>gene_name_4
FYCVLAHG
>gene_name_5
FCVLAHWG
>gene_name_6
YCVLAHWG"""

import re

x = re.findall(">.*?\n|(?:[^>].*?\n)+", s)
print(x)

产生

['>gene_name_1\n', 'FYCVLAHWG\nGGGGGGGGG\n', '>gene_name_2\n', 'FYCVLAHWG\n', '>gene_name_3\n', 'FYCVLAHWG\n', '>gene_name_4\n', 'FYCVLAHG\n', '>gene_name_5\n', 'FCVLAHWG\n', '>gene_name_6\n']

这几乎是你要求的。 的修改: 我刚刚意识到:这错过了最后一行:( [因为最后一行末尾没有换行] 的 EDIT2 : 有关正则表达式的更多详细信息:

  • [^>]匹配任何不是>
  • 的字符
  • re.findall(regexpr,input)如果在regexpr中没有定义捕获组,它将返回与该模式匹配的所有子类,否则它只返回匹配的组。这就是我使用的原因:
  • (?:)是一个非捕获组,即其内容不会添加到匹配的组列表中。

但我必须说:你真的想要">"和输出中的换行符?正如之前所评论的那样,您可能更好地逐行读取文件(或在\ n中拆分)。

另一种方法是:

result = []
cache= []
with open(input_file) as f:
  for line in f:
    if line[0] == '>':
      result.append('\n'.join(cache))
      result.append(line)
      cache = []
    else:
      cache.append(line)
  result.append('\n'.join(cache))

print result

或:

import collections

key = '<unknown>' # required in case the first line is not '>...'
result = collections.defaultdict(list)
with open(input_file) as f:
  for line in f:
    if line[0] == '>':
      key = line[1:]
    else:
      result[key].append(line)

print result

输出:

defaultdict(<class 'list'>, {'gene_name_1': ['FYCVLAHWG', 'GGGGGGGGG'], 'gene_name_4': ['FYCVLAHG'], 'gene_name_3': ['FYCVLAHWG'], 'gene_name_2': ['FYCVLAHWG'], 'gene_name_6': ['YCVLAHWG'], 'gene_name_5': ['FCVLAHWG']})

答案 1 :(得分:1)

试试这个正则表达式:

import re

pattern = r'(^>[^\n]*)([^>]*)'
flags = re.M|re.S
test_string = '''
>gene_name_1
FYCVLAHWG
GGGGGGGGG
>gene_name_2
FYCVLAHWG
>gene_name_3
FYCVLAHWG
>gene_name_4
FYCVLAHG
>gene_name_5
FCVLAHWG
>gene_name_6
YCVLAHWG
'''
print(list(re.findall(pattern, test_string, flags=flags)))
# [('>gene_name_1', '\nFYCVLAHWG\nGGGGGGGGG\n'), ('>gene_name_2', '\nFYCVLAHWG\n'), ('>gene_name_3', '\nFYCVLAHWG\n'), ('>gene_name_4', '\nFYCVLAHG\n'), ('>gene_name_5', '\nFCVLAHWG\n'), ('>gene_name_6', '\nYCVLAHWG\n')]