我的实验室会生成与质谱数据相关的非常大的文件。使用来自制造商的更新程序,一些数据写出了重复的内容,如下所示:
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028 2794.253
---lots more numbers of this format--
END IONS
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028 2794.253
---lots more duplicate numbers---
END IONS
所有块都是这种格式。我已经尝试编写一个程序来读取整个文件(1-2百万行),将这些行放在一个集合中,并将每个新行与该集合进行比较以查看它是否已被复制。然后,生成的行数组将打印到新文件中。应该在条件语句中跳过重复的块,但是当我运行程序时,它从不输入,而是打印出所有接收到的行
print('Enter file name to be cleaned (including extension, must be in same folder)')
fileinput = raw_input()
print('Enter output file name including extension')
fileoutput = raw_input()
with open (fileoutput, 'w') as fo:
with open(fileinput) as f:
largearray=[]
j=0
linecount=0
#read file over, append array
for line in f:
largearray.append(line)
linecount+=1
while j<linecount:
#initialize set
seen = set()
if largearray[j] not in seen:
seen.add(largearray[j])
# if the first line of the next chunk is a duplicate:
if 'BEGIN' in largearray[j] and largearray[j+5] in seen:
while 'END IONS' not in largearray[j]:
j+=1 #skip through all lines in the array until the next chunk is reached
print('writing: ',largearray[j])
fo.write(largearray[j])
j+=1
非常感谢任何帮助。
答案 0 :(得分:0)
所以只是为了澄清,
BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
对于重复的数字等重复这个吗?
所以你可以检查这些初始部分是否重复,如果是,请跳到下一个END IONS
答案 1 :(得分:0)
它没有跳过重复的原因是:
seen = set()
这是在错误的地方。如果它移出循环,那么代码将按预期工作:
with open (fileoutput, 'w') as fo:
with open(fileinput) as f:
largearray=list(f) #read file
seen = set() #initialize set before loop
j=0
while j<len(largearray):
if largearray[j] not in seen:
seen.add(largearray[j])
# if the first line of the next chunk is a duplicate:
if 'BEGIN' in largearray[j] and largearray[j+5] in seen:_
while 'END IONS' not in largearray[j]:
j+=1 #skip through all lines in the array until the next chunk is reached
j+=1 # Skip over `END IONS`
else:
print('writing: ',largearray[j])
fo.write(largearray[j])
j+=1
我做了另外两项调整:
循环使用f
的输入行将其保存在列表中是不必要的。这被替换为:
largearray=list(f)
理想情况下,为了处理大文件,我们不会一次读取整个文件,而是一次只读取一个BEGIN / END块。我将把它作为读者的练习。
即使是重复部分,代码也会打印END IONS
。通过以下方法避免了这种情况:(a)再次增加j
,以及(b)使用else
子句仅打印非重复部分。
awk
同一个问题可以在一行awk
中解决:
awk -F'\n' -v RS="BEGIN IONS\n" '$5 in seen || NF==0 {next;} {seen[$5]++;print RS,$0}' infile >outfile
说明:
-F'\n' -v RS="BEGIN IONS\n"
awk
一次读取一条记录。此处,记录定义为以BEGIN IONS
和换行符开头的任何文本。 awk
获取每条记录并将其划分为字段。在这里,我们将字段分隔符定义为换行符。每一行都成为一个字段。
$5 in seen || NF==0 {next;}
如果已经看到此记录中的第五行,我们跳过其余命令并跳转到next
记录。我们对任何不包含行的空记录都这样做。
seen[$5]++; print RS,$0
如果我们接到这个命令,那意味着之前没有看到过该记录。我们将第五行添加到数组seen
并打印此记录。
答案 2 :(得分:0)
如果文件很大,您应该逐行读取它并保存您感兴趣的数据。所以这是一个逐行的方法:
end_chunk = 'END IONS'
already_read_chunks = set([])
with open(fileinput) as f_in:
current_chunk = []
for line in f_in: #read iterative, save only data you need
line = line.strip() #remove trailings and white spaces
if line: #skip empty lines
current_chunk.append(line)
if line == end_chunk:
entire_chunk = '\n'.join(current_chunk) #rebuild chunk as string
if entire_chunk not in already_read_chunks: #check its existance
already_read_chunks.add(entire_chunk) #add if we haven't read it before
current_chunk = [] #restore current_chunk var, to restart process
with open (fileoutput, 'w') as f_out:
for chunk in already_read_chunks:
f_out.write(chunk)
f_out.write('\n')
f_out.write('\n')