删除几乎没有差异的重复行

时间:2018-08-14 12:56:02

标签: linux shell unix text centos

我有一个动态文本文件,该文件会自动写一些行。 但这与重复的条目有关:

例如:

1111 2222 3333 4444 <- I want this line
5555 6666 7777 8888 <- And this line too
1111 2222 3333 4444
5555 6666 7777 9999 <- Note : 9999 is only one ward change

预期结果:

1111 2222 3333 4444
5555 6666 7777 8888

真实测试前:

exten => 01272786170,1,Set(CALLERID(num)=821)
    same => n,Dial(SIP/port21/01272786170,60,rt)
    same => n,Set(thereis=yes01272786170)
    same => n,Set(calledid=01272786170)
    same => n,GotoIf("calledid" = "01272786170"?ejoin,01272786170,1)
exten => 01272786170,1,Set(CALLERID(num)=826) <- duplicated here with one number change
    same => n,Dial(SIP/port26/01272786170,60,rt) <-
exten => 01272786170,1,Set(CALLERID(num)=827) <-
    same => n,Dial(SIP/port27/01272786170,60,rt) <-

预期结果:

exten => 01272786170,1,Set(CALLERID(num)=821)
    same => n,Dial(SIP/port21/01272786170,60,rt)
    same => n,Set(thereis=yes01272786170)
    same => n,Set(calledid=01272786170)
    same => n,GotoIf("calledid" = "01272786170"?ejoin,01272786170,1)

注意:我希望使用Linux Shell来完成。

非常感谢您。

1 个答案:

答案 0 :(得分:0)

使用awk和您的第一个示例数据:

如果您使用Levenshtein算法(例如here)并提出足够的编辑距离(以下为4),则可以使用以下简单方法:

awk '
function levdist(str1, str2 ...)  # see the above link for working implementation
{
    ...
}
{
    for(i in a) {                 # iterate all previous stored strings
        l=levdist($0,a[i])        # compute the edit distance
        if(l<=4)                  # if below threshold
            next                  # skip to next string 
    }
    print $0                      # output where threshold was not met
    a[NR]=$0                      # store
}' file

输出:

1111 2222 3333 4444
5555 6666 7777 8888