我有一个如下所示的文本文件:
>gene_name_1
FYCVLAHWG
GGGGGGGGG
>gene_name_2
FYCVLAHWG
>gene_name_3
FYCVLAHWG
>gene_name_4
FYCVLAHG
>gene_name_5
FCVLAHWG
>gene_name_6
YCVLAHWG
我做了以下事情:
from sys import argv
import re
script, input_file = argv
opened_file = open(input_file).read()
test = re.findall('>.*\n|\n.*\n>', opened_file)
print test
我得到以下内容:
['>gene_name_1\n', '\nGGGGGGGGG\n>', '\nFYCVLAHWG\n>', '\nFYCVLAHWG\n>', '\nFYCVLAHG\n>', '\nFCVLAHWG\n>']
但我希望得到以下内容:
['>gene_name_1\n', '\nGGGGGGGGG\nFYCVLAHWG\n>', '>gene_name_2\n', '\nFYCVLAHWG\n>', '>gene_name_3\n', '\nFYCVLAHG\n>', '>gene_name_4\n', '\nFCVLAHWG\n>', '>gene_name_5\n', '\nFYCVLAHWG\n>', '>gene_name_6\n', '\nYCVLAHWG\n>]
为什么剩下的东西都丢失了?
答案 0 :(得分:1)
s = r""">gene_name_1
FYCVLAHWG
GGGGGGGGG
>gene_name_2
FYCVLAHWG
>gene_name_3
FYCVLAHWG
>gene_name_4
FYCVLAHG
>gene_name_5
FCVLAHWG
>gene_name_6
YCVLAHWG"""
import re
x = re.findall(">.*?\n|(?:[^>].*?\n)+", s)
print(x)
产生
['>gene_name_1\n', 'FYCVLAHWG\nGGGGGGGGG\n', '>gene_name_2\n', 'FYCVLAHWG\n', '>gene_name_3\n', 'FYCVLAHWG\n', '>gene_name_4\n', 'FYCVLAHG\n', '>gene_name_5\n', 'FCVLAHWG\n', '>gene_name_6\n']
这几乎是你要求的。 的修改: 我刚刚意识到:这错过了最后一行:( [因为最后一行末尾没有换行] 的 EDIT2 强>: 有关正则表达式的更多详细信息:
[^>]
匹配任何不是> re.findall(regexpr,input)
如果在regexpr中没有定义捕获组,它将返回与该模式匹配的所有子类,否则它只返回匹配的组。这就是我使用的原因:(?:)
是一个非捕获组,即其内容不会添加到匹配的组列表中。但我必须说:你真的想要">"和输出中的换行符?正如之前所评论的那样,您可能更好地逐行读取文件(或在\ n中拆分)。
另一种方法是:
result = []
cache= []
with open(input_file) as f:
for line in f:
if line[0] == '>':
result.append('\n'.join(cache))
result.append(line)
cache = []
else:
cache.append(line)
result.append('\n'.join(cache))
print result
或:
import collections
key = '<unknown>' # required in case the first line is not '>...'
result = collections.defaultdict(list)
with open(input_file) as f:
for line in f:
if line[0] == '>':
key = line[1:]
else:
result[key].append(line)
print result
输出:
defaultdict(<class 'list'>, {'gene_name_1': ['FYCVLAHWG', 'GGGGGGGGG'], 'gene_name_4': ['FYCVLAHG'], 'gene_name_3': ['FYCVLAHWG'], 'gene_name_2': ['FYCVLAHWG'], 'gene_name_6': ['YCVLAHWG'], 'gene_name_5': ['FCVLAHWG']})
答案 1 :(得分:1)
试试这个正则表达式:
import re
pattern = r'(^>[^\n]*)([^>]*)'
flags = re.M|re.S
test_string = '''
>gene_name_1
FYCVLAHWG
GGGGGGGGG
>gene_name_2
FYCVLAHWG
>gene_name_3
FYCVLAHWG
>gene_name_4
FYCVLAHG
>gene_name_5
FCVLAHWG
>gene_name_6
YCVLAHWG
'''
print(list(re.findall(pattern, test_string, flags=flags)))
# [('>gene_name_1', '\nFYCVLAHWG\nGGGGGGGGG\n'), ('>gene_name_2', '\nFYCVLAHWG\n'), ('>gene_name_3', '\nFYCVLAHWG\n'), ('>gene_name_4', '\nFYCVLAHG\n'), ('>gene_name_5', '\nFCVLAHWG\n'), ('>gene_name_6', '\nYCVLAHWG\n')]