Question

我是python的新手，想尝试使用它在制表符分隔的文本文件（mydata）的每一行中的匹配模式之间提取文本

mydata.txt：

Sequence                                                                                                            tRNA    Bounds  tRNA    Anti    Intron Bounds   Cove
Name                                                                                                            tRNA #  Begin   End Type    Codon   Begin   End Score
--------                                                                                                        ------  ----    ------  ----    -----   -----   ----    ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33                                                 1   1   73  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33                                                 1   1   72  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1

我尝试的代码：

lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
    # Skips text before the beginning of the interesting block:
    for line in input_data:
        # print(line)
        if line.strip() == "locus_tag=":  # Or whatever test is needed
            break
    # Reads text until the end of the block:
    for line in input_data:  # This keeps reading the file
        if line.strip() == "][db":
            break
        print(line)  # Line is extracted (or block_of_lines.append(line), etc.)

我想抓取[locus_tag=和][db_xre之间的文本并将其作为结果：

SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127

Answer 1

如果我正确理解，这应该适用于给定的数据行：

data = line.split("locus_tag=")[1].split("][db_xref")[0]

这个想法是在locus_tag=上分割字符串，获取第二个元素，然后在][db_xref上分割字符串，并获取第一个元素。

如果您需要外部循环方面的帮助，它可能看起来像：

for line in open(file_path, 'r'):
    if "locus_tag" in line:
        data = line.split("locus_tag=")[1].split("][db_xref")[0]
        print(data)

Answer 2

您可以将re.search用于正向后看和正向先行模式：

import re
...
for line in input_data:
    match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
    if match:
        print(match.group())

如何在python中的匹配模式之间提取文本

2 个答案: