Question

我有一个动态文本文件，该文件会自动写一些行。但这与重复的条目有关：

例如：

1111 2222 3333 4444 <- I want this line
5555 6666 7777 8888 <- And this line too
1111 2222 3333 4444
5555 6666 7777 9999 <- Note : 9999 is only one ward change

预期结果：

1111 2222 3333 4444
5555 6666 7777 8888

真实测试前：

exten => 01272786170,1,Set(CALLERID(num)=821)
    same => n,Dial(SIP/port21/01272786170,60,rt)
    same => n,Set(thereis=yes01272786170)
    same => n,Set(calledid=01272786170)
    same => n,GotoIf("calledid" = "01272786170"?ejoin,01272786170,1)
exten => 01272786170,1,Set(CALLERID(num)=826) <- duplicated here with one number change
    same => n,Dial(SIP/port26/01272786170,60,rt) <-
exten => 01272786170,1,Set(CALLERID(num)=827) <-
    same => n,Dial(SIP/port27/01272786170,60,rt) <-

预期结果：

exten => 01272786170,1,Set(CALLERID(num)=821)
    same => n,Dial(SIP/port21/01272786170,60,rt)
    same => n,Set(thereis=yes01272786170)
    same => n,Set(calledid=01272786170)
    same => n,GotoIf("calledid" = "01272786170"?ejoin,01272786170,1)

注意：我希望使用Linux Shell来完成。

非常感谢您。

Answer 1

使用awk和您的第一个示例数据：

如果您使用Levenshtein算法（例如here）并提出足够的编辑距离（以下为4），则可以使用以下简单方法：

awk '
function levdist(str1, str2 ...)  # see the above link for working implementation
{
    ...
}
{
    for(i in a) {                 # iterate all previous stored strings
        l=levdist($0,a[i])        # compute the edit distance
        if(l<=4)                  # if below threshold
            next                  # skip to next string 
    }
    print $0                      # output where threshold was not met
    a[NR]=$0                      # store
}' file

输出：

1111 2222 3333 4444
5555 6666 7777 8888

删除几乎没有差异的重复行

1 个答案: