MacOS,python 2.7
我正在尝试解析.txt文件并提取我想要创建制表符分隔表的字符串。我将不得不为许多文件执行此操作,但我在选择某些字符串时遇到问题。
以下是输入文件示例:
# Assembly name: ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome
CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na
到目前为止,我的代码如下所示,headtring指示列标题:
# Open the input file for reading
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')
# Write the header
Headstring= "GenBank_Assembly_ID RefSeq_Assembly_ID Assembly_level Chromosome Plasmid Refseq_chromosome Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"
# Set up chromosome and plasmid count
ccount = 0
pcount = 0
# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
for line in searchfile:
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
if "Chromosome" in line:
ccount += 1
print ccount
if "Plasmid" in line:
pcount += 1
print pcount
OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)
OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)
InFile.close()
OutFile.close()
我遇到的主要问题是我想提取字符串NC_010611.1,NC_010605.1和NC_010606.1,并且它们之间的标签空间在同一行上,因此它们最终位于标题下Refseq_chromosome,Refseq_plasmid1和Refseq_plasmid2分别。但我只想让脚本搜索这些如果汇编="染色体"或者"完整的基因组"。我不确定如果这个条件成立,如何搜索字符串。
我知道获取这些字符串的正则表达式可以是' = \ t(\ w + ..)',但就我而言。
我对python很新,所以解释会很棒。提前谢谢!
答案 0 :(得分:3)
看一下这个例子:
import re
InFileName = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'
# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"
# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
chromosomes = []
plasmids = []
for line in InFile:
if line.lstrip()[0] == '#':
# Process header part of the file differently from the data part
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
elif assembly in ['Chromosome', 'Complete Genome']:
# Process each data line separately
split_line = line.split()
Type = split_line[3]
RefSeq_Accn = split_line[6]
if Type == "Chromosome":
chromosomes.append(RefSeq_Accn)
if Type == "Plasmid":
plasmids.append(RefSeq_Accn)
# Merge names of up to N chromosomes
N = 1
cstr = ''
for i in range(N):
if i < len(chromosomes):
nextChromosome = chromosomes[i]
else:
nextChromosome = ''
cstr += '\t' + nextChromosome
# Merge names of up to M plasmids
M = 5
pstr = ''
for i in range(M):
if i < len(plasmids):
nextPlasmid = plasmids[i]
else:
nextPlasmid = ''
pstr += '\t' + nextPlasmid
OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
OutputString += cstr
OutputString += pstr
OutFile.write(Headstring+'\n'+OutputString)
输入:
# Assembly name: ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na
输出:
GenBank_Assembly_ID RefSeq_Assembly_ID Assembly_level Chromosome Plasmid Refseq_chromosome Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5
GCA_000018445.1 GCF_000018445.1 Complete Genome 1 2 NC_010611.1 NC_010605.1 NC_010606.1
与您的脚本的主要区别:
if line.lstrip()[0] == '#'
来处理&#34;标题&#34;行(以散列字符开头的行)与&#34;表行&#34;不同;在底部(实际包含每个序列的数据的行)。if assembly in ['Chromosome', 'Complete Genome']
- 这是您在问题中指定的条件split_line = line.split()
这样的值。之后我按Type = split_line[3]
获取了类型(这是表格数据中的第四列),RefSeq_Accn = split_line[6]
给了我表格中的第七列。答案 1 :(得分:0)
您可以先将所有数据读入pandas数据帧,然后再开始使用。 然后你可以以一种以另一列为条件的方式处理一个列(无论包含&#39; NC_010611.1&#39;)。请参阅此处的示例:Pandas conditional creation of a series/dataframe column。
可能在一次通过数据时可以获得您想要的内容,但如果您通过数据进行2次传递,则可能更容易编写和读取。