对不起,标题很抱歉,我不知道该如何用其他方式表达我的问题。
我编写了一个脚本,该脚本从fastq文件(纯文本基因组读取文件)中提取数据。每第一行是一个标题,第二行是一个基本字符串-不需要第三和第四行。
filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'
with open(filename) as f_obj:
file_contents = f_obj.readlines()
extracted_lines = ''
line_count = 0
# Pull header and base lines
for line in file_contents:
line_count += 1
# Headers
if line_count == 1:
extracted_lines += line
# Reads ending in A
elif line_count == 2 and line[-2] == 'A':
extracted_lines += line
# Reads ending in G
elif line_count == 2 and line[-2] == 'G':
extracted_lines += line
# Reset counter
elif line_count == 4:
line_count = 0
with open(new_filename, 'w') as f_obj:
f_obj.write(extracted_lines)
print(new_filename + " was created.")
只要读取的碱基以A或G结尾,脚本就会提取每个读取的标头以及读取中的碱基字符串。 输入文件的示例为:
@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
+
#1=BDDFFHHHFHIJJJJJJJJJJJJJJJJJJJJJIJJIJJJJJHJIIJHGIJJJJJJIHJJBGHJHIIJJJHHHHFFFFEEEDD;?BACDDDA?@CDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
+
#1=DDDFDFHHHGHIIGJJJJHIJIHHDIHHIJGGEI@GFGHIHIJHEFHIIIIGIJGHHGECFGIDHGIHIIEGIIJHHEEFFF7?ACEECCBBDEDDDC
输出文件如下所示。
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1200:1972 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA
尽管其中包含所有需要的数据,但它会拉出每个标题行(以“ @”开头),这是不必要的。
如果我的代码以一串以A或G结尾的碱基开头,该如何修改我的代码以仅拉出标题行?
答案 0 :(得分:1)
问题是您要为每条记录添加 id ,而不仅是对您感兴趣的记录。快速的解决方案是保留 id 在变量中,仅在必要时添加它:
filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'
with open(filename) as f_obj:
file_contents = f_obj.readlines()
extracted_lines = ''
line_count = 0
# Pull header and base lines
for line in file_contents:
line_count += 1
# Headers
if line_count == 1:
id_string = line
# Reads ending in A
elif line_count == 2 and line[-2] == 'A':
extracted_lines += id_string
extracted_lines += line
# Reads ending in G
elif line_count == 2 and line[-2] == 'G':
extracted_lines += id_string
extracted_lines += line
# Reset counter
elif line_count == 4:
line_count = 0
with open(new_filename, 'w') as f_obj:
f_obj.write(extracted_lines)
print(new_filename + " was created.")
我还不得不说代码效率不高,特别是在内存使用方面:您正在将一个(通常)很大的文件读入内存,但是一次只需要一个记录。
次要问题是可以压缩条件,并且可以使用模数来知道自己是哪种线型:
filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'
with open(filename) as in_f_obj, open(new_filename, 'w') as out_f_obj:
# Process the file
line_count = 0
for line in in_f_obj:
line_count += 1
# Extract the information for each record
if line_count % 4 == 1:
id_string = line
elif line_count % 4 == 2:
seq = line
elif line_count % 4 == 3:
extra = line
elif line_count % 4 == 4:
# Last part of the record. Here we have all the information
# and we can decide if we want to output something
# and what we want to output
qual = line
if seq[-2] == 'A' or seq[-2] == 'G'
out_f_obj.write(id_string)
out_f_obj.write(seq)
print(new_filename + " was created.")
在此代码中,您仅在内存中保留一条记录。 line_count
变量包含已处理的实际行数,并且您已从输入中获取了所有数据,因此以后可以很容易地更改输出。
我会添加一个额外的细节,我会在每个读取行中剥离换行符,并在编写时根据需要添加:
# Extract the information for each record
if line_count % 4 == 1:
id_string = line.rstrip()
elif line_count % 4 == 2:
seq = line.rstrip()
elif line_count % 4 == 3:
extra = line.rstrip()
elif line_count % 4 == 4:
# Last part of the record. Here we have all the information
# and we can decide if we want to output something
# and what we want to output
qual = line.rstrip()
if seq[-1] == 'A' or seq[-1] == 'G'
out_f_obj.write("{}\n{}\n".format(id_string, seq))
那样,您的数据就干净了,输入文件中没有换行格式。
答案 1 :(得分:0)
我认为这将使您的任务更容易以4行而不是单行的方式遍历文件。至少假设实际上总是总是有4条线彼此相关。然后,您可以在附加相应的标题行之前过滤所需的碱基,例如:
extracted_lines = []
for i in range(0, len(file_contents), 4):
header, bases, comment1, comment2 = file_contents[i:i+4]
if bases[-1] in ["A", "G"]:
extracted_lines.append(header)
extracted_lines.append(bases)