如何删除以python中的特定单词开头的重复行

时间:2015-07-22 19:46:15

标签: python

我有一个

形式的输入文件

所有测试都以“测试”一词开头,所有错误都以“错误”一词开头

Test1
Error1
Error1 
Error2
Test1
Error3

Test2
Error1
Error4 

Test2
Error5
Error1

Test3
Error1

I want it in the format:
Test1
Error1
Error1
Error2
Error3 // Removed test1 

Test2
Error1
Error4
Error5
Error1

Test3
Error1 

基本上,在浏览文件时,它应删除重复的测试名,并以相同的顺序将其写入输出文件。 以下是我的代码

import os
import sys
import optparse

def delete_duplicate(inputfile,outputfile): 
    output = open(outputfile, "w")
    from collections import OrderedDict
    input = open(inputfile, "r")
    lines = (line.strip() for line in input)
    unique_lines = OrderedDict.fromkeys((line for line in lines if line))
    for unique_line in unique_lines:
        output.write(unique_line)
        output.write("\n") 

My code removes duplicate lines and gives result as below: 
Test1
Error1
Error2
Error3 

Test2
Error4
Error5

Test3 

它可以正常使用测试名但不会出错。有人可以帮忙吗?

2 个答案:

答案 0 :(得分:0)

您只需要保留一组中以Test开头的行,并检查您是否已经将它写在输出文件中:

def delete_duplicate(inputfile,outputfile,seen={}):
    with open(outputfile, "w") as output,open(inputfile, "r") as input: 
      for line in input:
        if line not in seen:
             output.write(line+'\n')
        if line.startswith('Test'):
            seen.add(line)

set的优点是其订单为O(1),用于检查会员资格和添加项目。

答案 1 :(得分:0)

目前看起来您的代码只是将每行插入字典中,如果它之前没有遇到它。您似乎也想要为每个测试单独跟踪错误。您可以使用OrderedDict执行此操作,看起来有点像这样:

output_dict = {
    'test1' : ['Error1','Error1','Error2','Error3'],
    'test2' : ['Error1','Error4','Error5','Error1']
}

处理此问题的代码如下所示。

import os
import sys
import optparse
from collections import OrderedDict


def delete_duplicate(inputfile,outputfile): 
    # Declare the files and get the lines
    outfile = open(outputfile, "w")
    infile = open(inputfile, "r")
    lines = (line.strip() for line in infile)

    output_dict = OrderedDict()
    currentTest = '' # Used to keep track of which test we are working with

    for line in lines:
        if line.startswith('Test'): # A new test is starting
            currentTest = line
            if currentTest not in output_dict:
                output_dict[currentTest] = []
        elif line.startswith('Error'): # Add the error to the current test
            output_dict[currentTest].append(line)

    for test in output_dict.keys():
        outfile.write(test + '\n') # Write the test number
        for error in output_dict[test]:
            outfile.write(error + '\n') # Write the errors for that test
        outfile.write('\n')