awk仅当前面的字段相同时才能删除字段中的重复项

时间:2011-01-24 18:20:47

标签: awk duplicates

我试图从字段中删除重复项(并用空格替换它们),前提是前面的字段是相同的。例如:

示例输入:

France  Paris      Museum of Fine Arts          blabala
France  Paris      Museum of Fine Arts          blajlk
France  Paris      Yet another museum           lqmsjdf
France  Paris      Museum of National History            mlqskjf
France  Bordeaux   Museum of Fine Arts          qsfsqf
France  Bordeaux   City Hall                lmqjflqsk
France  Bordeaux   City Hall                    lqkjfqlskjflqskfj
Spain   Madrid     Museum of Fine Arts          lqksjfh
Spain   Madrid     Museum of Fine Arts          qlmfjlqsjf
Spain   Barcelona  City Hall                nvqjvvnqk
Spain   Barcelona  Museum of Fine Arts          lmkqjflqksfj

期望的输出:

France    Paris        Museum of FineArts                    blabala
                                                             blajlk
                       Yet another museum                    lqmsjdf
                       Museum of National History            mlqskjf
          Bordeaux     Museum of Fine Arts                   qsfsqf
                       City Hall                             lmqjflqsk
                                                             lqkjfqlskjflqskfj
Spain     Madrid       Museum of Fine Arts                   lqksjfh
                                                             qlmfjlqsjf
          Barcelona   City Hall                              nvqjvvnqk
                      Museum of Fine Arts                    lmkqjflqksfj

提前感谢您提供任何帮助。

2 个答案:

答案 0 :(得分:1)

尝试一下:

awk -F '\t' 'BEGIN {OFS=FS} {if ($1 == prev1) $1 = ""; else prev1 = $1; if ($2 == prev2) $2 = ""; else prev2 = $2; if ($3 == prev3) $3 = ""; else prev3 = $3; print}' inputfile

这是一个适用于任意数量字段的较短版本(始终打印最后一个字段):

awk -F '\t' 'BEGIN {OFS=FS} {for (i=1; i<=NF-1;i++) if ($i == prev[i]) $i = ""; else prev[i] = $i; print}' inputfile

输出不会与屏幕使用对齐,但会有正确数量的标签。

输出将如下所示:

field1 TAB field2 TAB field3 TAB field4
TAB TAB TAB field4
TAB TAB field3 TAB field4
TAB field2 TAB field3 TAB field4
etc.

如果您需要对齐列,也可以。

修改

此版本允许您指定要进行重复数据删除的字段:

#!/usr/bin/awk -f
BEGIN {
    FS="\t"; OFS=FS
    deduplist=ARGV[1]
    ARGV[1]=""
    split(deduplist,tmp," ")
    for (i in tmp) dedup[tmp[i]]=1
}
{
    for (i=1; i<=NF;i++)
        if (i in dedup) {
            if ($i == prev[i])
                $i = ""
            else
                prev[i] = $i
        }
    # prevent printing lines that are completely blank because 
    # it's an exact duplicate of the preceding line and all fields 
    # are being deduplicated
    if ($0 !~ /^[[:blank:]]*$/) 
        print
}

运行如下:./script.awk "2 3" inputfile重复删除第2和第3个字段。

答案 1 :(得分:0)

试试这个Perl one-liner:

perl  -F"\t" -nae '@O=@F;if(!$x){$x=1}else{for($i=0;$i<=$#S;$i++){$F[$i]=""if($S[$i] eq "" || $S[$i] eq $F[$i])}};print join "\t",@F;@S=@O;'

See it

我假设这些字段是制表符分隔的。