通过搜索文件中的特定列值来查找重复记录

时间:2016-04-21 02:46:02

标签: unix

我是unix的新手你可以帮我找到重复的记录

基于姓名,EmpId和指定重复

输入文件:

"Name" ,      "Address",       ËmpId","      designation",        "office location"
"NameValue","AddressValue",ËmpIdValue","designationValue","office locationValue"
"NameValue1","AddressValue1",ËmpIdValue1","designationValue1","office locationValue1"
"NameValue","AddressValue1",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue2",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressVal4ue",ËmpIdValue1","designationValue","office locationValue"

输出文件:

"NameValue","AddressValue",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue1",ËmpIdValue","designationValue","office locationValue"
"NameValue","AddressValue2",ËmpIdValue","designationValue","office locationValue"

1 个答案:

答案 0 :(得分:0)

可能python脚本最适合这个:

import fileinput

dict = {}

for line in fileinput.input():
    tokens = line.split(",")
    key = tokens[0] + "###" + tokens[2] + "###" + tokens[3]
    if key in dict:
        # print the previous duplicate, if it wasn't printed yet                                                                                                                                                   
        if len(dict[key]):
            print dict[key],
            dict[key] = ""
        print line,
    else:
        dict[key] = line

对于生产用途,您可能希望使用更复杂的算法来使密钥唯一,但总体思路是一样的。