Question

一个作业在服务器上运行，它会创建一个如下文件：

1000727888004
522101 John Smith
522101 John Smith
522188 Shelly King
522188 Shelly King
1000727888002
522990 John Doe
522990 John Doe
9000006000000

目前，我们正在修复代码，但这需要一个月的时间。与此同时，我正在使用命令删除下面的重复记录。

perl -ne 'print unless $dup{$_}++;' old_file.txt > new_file.txt

运行上述命令后，它会删除重复的条目，但计数保持不变如下：

1000727888004
522101 John Smith
522188 Shelly King
1000727888002
522990 John Doe
9000006000000

以1开头的行的最后一个数字是总计数（因此第一行中4应该是2，第四行中2应该是1，而从9开始的最后一行中6应该是3）。它应该如下所示：

1000727888002
522101 John Smith
522188 Shelly King
1000727888001
522990 John Doe
9000003000000

我无法想出任何可以修复它的逻辑。我需要帮助。我可以运行另一个命令或在我的perl命令中添加一些内容来更正计数。是的，我可以在Notepad ++中打开文件并手动修复数字，但我试图让它自动化。

谢谢！

Answer 1

在awk中。它处理计数记录之间“块”内的欺骗，即。它不考虑整个文件中的重复。如果这是不正确的假设，请告诉我。

$ awk '
NF==1 {          # for the cout record 
    if(c!="")    # this fixes leading empty row
        print c  # print count
    for(i in a)  # all deduped data records
        print i  # print them
    delete a     # empty hash
    c=$0         # store count (well, you could use just the first count record)
    next         # for this record don't process further
}
{
    if($0 in a)  # if current record is already in a
        c--      # decrease count
    else a[$0]   # else hash it
}
END {            # last record handling
    print c      # print the last record
    for(i in a)  # just in case last record would be missing
        print i  # this and above could be removes
}' file

输出：

1000727888002
522101 John Smith
522188 Shelly King
1000727888001
522990 John Doe
9000006000000

如果在整个文件中删除了欺骗，并且最后一条记录也是计数：

awk '
NF==1 {
    if(NR==1)
        c=$0
    print c
} 
NF>1 {
    if($0 in a)
        c--
    else {
        a[$0]
        print
    }
}' file
1000727888004
522101 John Smith
522188 Shelly King
1000727888002
522990 John Doe
1000727888001

重写重复值后计算记录数

1 个答案: