Question

这是txt文件的结构（CDS-text-ORIGIN的重复单元）：

     CDS             311..>428
                     /gene="PNR"
                     /codon_start=1
                     /product="photoreceptor-specific nuclear receptor"
                     /protein_id="AAD28302.1"
                     /db_xref="GI:4726077"
                     /translation="METRPTALMSSTVAAAAPAAGAASRKESPGRWGLGEDPT"
 ORIGIN

我想将文本从311 ..＆lt; 428拉出到GEDPT“作为字符串我到目前为止的正则表达式是：

compiler = re.compile(r"^\s+CDS\s+(.+)ORIGIN.+", re.DOTALL|re.MULTILINE)

然后我使用循环将每个字符串添加到列表中：

for line in file:
    match = compiler.match(line)
    if match:
        list.append(str(match.group(1)))

但我不断得到一个清单！有什么想法吗？

非常感谢帮助，我是新手！

Answer 1

我假设file是file = open('filename.txt')等文件指针。如果是这种情况，那么使用：

for line in file:

将断开换行符上的每一行。所以前三行将是：

1: '     CDS             311..>428\n'
2: '                     /gene="PNR"\n'
3: '                     /codon_start=1:\n'

因为每条线是分开的，所以除非组合线，否则不会匹配多线图案。您可能需要考虑使用：

compiler = re.compile(r"^\s+CDS\s+(.+?)ORIGIN", re.DOTALL|re.MULTILINE)
fp = open('filename.txt')
all_text = fp.read()         # this reads all the text without splitting on newlines
compiler.findall(all_text)   # returns a list of all matches

Python正则表达式 - 将两个单词之间的文本捕获为字符串，然后附加到列表中

1 个答案: