如何在python中的匹配模式之间提取文本

时间:2019-03-15 17:37:23

标签: python string

我是python的新手,想尝试使用它在制表符分隔的文本文件(mydata)的每一行中的匹配模式之间提取文本

mydata.txt:

Sequence                                                                                                            tRNA    Bounds  tRNA    Anti    Intron Bounds   Cove
Name                                                                                                            tRNA #  Begin   End Type    Codon   Begin   End Score
--------                                                                                                        ------  ----    ------  ----    -----   -----   ----    ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33                                                 1   1   73  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33                                                 1   1   72  Pseudo  ??? 0   0   -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33                                                 1   1   71  Pseudo  ??? 0   0   -1

我尝试的代码:

lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
    # Skips text before the beginning of the interesting block:
    for line in input_data:
        # print(line)
        if line.strip() == "locus_tag=":  # Or whatever test is needed
            break
    # Reads text until the end of the block:
    for line in input_data:  # This keeps reading the file
        if line.strip() == "][db":
            break
        print(line)  # Line is extracted (or block_of_lines.append(line), etc.)

我想抓取[locus_tag=][db_xre之间的文本并将其作为结果:

SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127

2 个答案:

答案 0 :(得分:1)

如果我正确理解,这应该适用于给定的数据行:

data = line.split("locus_tag=")[1].split("][db_xref")[0]

这个想法是在locus_tag=上分割字符串,获取第二个元素,然后在][db_xref上分割字符串,并获取第一个元素。

如果您需要外部循环方面的帮助,它可能看起来像:

for line in open(file_path, 'r'):
    if "locus_tag" in line:
        data = line.split("locus_tag=")[1].split("][db_xref")[0]
        print(data)

答案 1 :(得分:1)

您可以将re.search用于正向后看和正向先行模式:

import re
...
for line in input_data:
    match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
    if match:
        print(match.group())