Question

我的实验室会生成与质谱数据相关的非常大的文件。使用来自制造商的更新程序，一些数据写出了重复的内容，如下所示：

BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028    2794.253
---lots more numbers of this format--
END IONS

BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+
120.0028    2794.253
---lots more duplicate numbers---
END IONS

所有块都是这种格式。我已经尝试编写一个程序来读取整个文件（1-2百万行），将这些行放在一个集合中，并将每个新行与该集合进行比较以查看它是否已被复制。然后，生成的行数组将打印到新文件中。应该在条件语句中跳过重复的块，但是当我运行程序时，它从不输入，而是打印出所有接收到的行

print('Enter file name to be cleaned (including extension, must be in same folder)')
fileinput = raw_input()
print('Enter output file name including extension')
fileoutput = raw_input()

with open (fileoutput, 'w') as fo:
    with open(fileinput) as f:
        largearray=[]
        j=0
        linecount=0
    #read file over, append array
        for line in f:
            largearray.append(line)
            linecount+=1
        while j<linecount:
    #initialize set
            seen = set()
            if largearray[j] not in seen:
                seen.add(largearray[j])
    # if the first line of the next chunk is a duplicate:
            if 'BEGIN' in largearray[j] and largearray[j+5] in seen: 
                while 'END IONS' not in largearray[j]:
                    j+=1 #skip through all lines in the array until the next chunk is reached
            print('writing: ',largearray[j])
            fo.write(largearray[j])
            j+=1

非常感谢任何帮助。

Answer 1

所以只是为了澄清，

BEGIN IONS
TITLE=IgA_OTHCD_uni.3.3.2
RTINSECONDS=0.6932462
PEPMASS=702.4431
CHARGE=19+

对于重复的数字等重复这个吗？

所以你可以检查这些初始部分是否重复，如果是，请跳到下一个END IONS

Answer 2

它没有跳过重复的原因是：

seen = set()

这是在错误的地方。如果它移出循环，那么代码将按预期工作：

with open (fileoutput, 'w') as fo:
    with open(fileinput) as f:
        largearray=list(f)  #read file
        seen = set()  #initialize set before loop
        j=0
        while j<len(largearray):
            if largearray[j] not in seen:
                seen.add(largearray[j])
            # if the first line of the next chunk is a duplicate:
            if 'BEGIN' in largearray[j] and largearray[j+5] in seen:_
                while 'END IONS' not in largearray[j]:
                    j+=1 #skip through all lines in the array until the next chunk is reached
                j+=1  # Skip over `END IONS`
            else:
                print('writing: ',largearray[j])
                fo.write(largearray[j])
                j+=1

我做了另外两项调整：

循环使用f的输入行将其保存在列表中是不必要的。这被替换为：
```
largearray=list(f)
```
理想情况下，为了处理大文件，我们不会一次读取整个文件，而是一次只读取一个BEGIN / END块。我将把它作为读者的练习。
即使是重复部分，代码也会打印END IONS。通过以下方法避免了这种情况：（a）再次增加j，以及（b）使用else子句仅打印非重复部分。

使用`awk`

同一个问题可以在一行awk中解决：

awk -F'\n' -v RS="BEGIN IONS\n" '$5 in seen || NF==0 {next;} {seen[$5]++;print RS,$0}' infile >outfile

说明：

-F'\n' -v RS="BEGIN IONS\n"

awk一次读取一条记录。此处，记录定义为以BEGIN IONS和换行符开头的任何文本。 awk获取每条记录并将其划分为字段。在这里，我们将字段分隔符定义为换行符。每一行都成为一个字段。
$5 in seen || NF==0 {next;}

如果已经看到此记录中的第五行，我们跳过其余命令并跳转到next记录。我们对任何不包含行的空记录都这样做。
seen[$5]++; print RS,$0

如果我们接到这个命令，那意味着之前没有看到过该记录。我们将第五行添加到数组seen并打印此记录。

Answer 3

如果文件很大，您应该逐行读取它并保存您感兴趣的数据。所以这是一个逐行的方法：

end_chunk           = 'END IONS'
already_read_chunks = set([])

with open(fileinput) as f_in:
    current_chunk   = []
    for line in f_in:                                       #read iterative, save only data you need
        line = line.strip()                                 #remove trailings and white spaces
        if line:                                            #skip empty lines
            current_chunk.append(line)
            if line == end_chunk:
                entire_chunk = '\n'.join(current_chunk)     #rebuild chunk as string
                if entire_chunk not in already_read_chunks: #check its existance
                    already_read_chunks.add(entire_chunk)   #add if we haven't read it before
                current_chunk = []                          #restore current_chunk var, to restart process

with open (fileoutput, 'w') as f_out:
    for chunk in already_read_chunks:
        f_out.write(chunk)
        f_out.write('\n')
        f_out.write('\n')

从一个非常大的文件中删除python中的重复文件块

3 个答案:

使用`awk`

从一个非常大的文件中删除python中的重复文件块

3 个答案:

使用awk

使用`awk`