我是python的新手,想尝试使用它在制表符分隔的文本文件(mydata)的每一行中的匹配模式之间提取文本
mydata.txt:
Sequence tRNA Bounds tRNA Anti Intron Bounds Cove
Name tRNA # Begin End Type Codon Begin End Score
-------- ------ ---- ------ ---- ----- ----- ---- ------
lcl|NC_035155.1_gene_75[locus_tag=SS1G_20133][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_73[locus_tag=SS1G_20131][db_xref=GeneID:33 1 1 73 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_72[locus_tag=SS1G_20130][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_71[locus_tag=SS1G_20129][db_xref=GeneID:33 1 1 72 Pseudo ??? 0 0 -1
lcl|NC_035155.1_gene_62[locus_tag=SS1G_20127][db_xref=GeneID:33 1 1 71 Pseudo ??? 0 0 -1
我尝试的代码:
lines = [] #Declare an empty list named "lines"
with open('/media/owner/c3c5fbb4-73f6-45dc-a475-988ad914056e/phasing/trna/test.txt') as input_data:
# Skips text before the beginning of the interesting block:
for line in input_data:
# print(line)
if line.strip() == "locus_tag=": # Or whatever test is needed
break
# Reads text until the end of the block:
for line in input_data: # This keeps reading the file
if line.strip() == "][db":
break
print(line) # Line is extracted (or block_of_lines.append(line), etc.)
我想抓取[locus_tag=
和][db_xre
之间的文本并将其作为结果:
SS1G_20133
SS1G_20131
SS1G_20130
SS1G_20129
SS1G_20127
答案 0 :(得分:1)
如果我正确理解,这应该适用于给定的数据行:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
这个想法是在locus_tag=
上分割字符串,获取第二个元素,然后在][db_xref
上分割字符串,并获取第一个元素。
如果您需要外部循环方面的帮助,它可能看起来像:
for line in open(file_path, 'r'):
if "locus_tag" in line:
data = line.split("locus_tag=")[1].split("][db_xref")[0]
print(data)
答案 1 :(得分:1)
您可以将re.search
用于正向后看和正向先行模式:
import re
...
for line in input_data:
match = re.search(r'(?<=\[locus_tag=).*(?=\]\[db_xre)', line)
if match:
print(match.group())