在具有相似内容的同一文件中加入两行

时间:2014-11-13 14:58:11

标签: linux shell unix awk

我试图在shell中解决这个特定的问题,但我还没有得到任何东西......请帮忙!

我有一个file.txt,其格式超过30K:

phoneNumber|ID|CITY|NAME|SURNAME1|SURNAME2|NAME SURNAME1 SURNAME2|

例如我有这个输入文件:

 558000003|11111113B|LONDON|NAME FAKE3|SURNAME FAKE3|SURNAMEFAKE_3|NAME SURNAME1 SURNAME2|
 558000002|11111112B|LONDON|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
 558000001|11111111B|LONDON|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|
 558000003|11111113B|BERLIN|NAME FAKE3|SURNAME FAKE3|SURNAMEFAKE_3|NAME SURNAME1 SURNAME2|
 557000002|11111112A|BERLIN|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
 557000001|11111111A|BERLIN|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|

如您所见,第1行和第4行相似,但第3列。我想得到的是这个输出:

 558000003|11111113B|LONDON,BERLIN|NAME FAKE3|SURNAME FAKE3|SURNAMEFAKE_3|NAME SURNAME1 SURNAME2|
 558000002|11111112B|LONDON|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
 558000001|11111111B|LONDON|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|
 557000002|11111112A|BERLIN|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
 557000001|11111111A|BERLIN|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|

我不关心输出线的顺序。我试图用命令来解决这个问题" awk"在脚本shell中,但没有任何作用...

如果一个字段中有巧合,是否可以连接线?

2 个答案:

答案 0 :(得分:2)

假设$ 1和$ 2的组合创建唯一键:

$ awk '
BEGIN { FS=OFS="|" }
{
    key = $1 SUBSEP $2
    keys[key]
    for (i=1; i<=NF; i++) {
        if ( !seen[key,i,$i]++ && ((key,i) in fld) ) {
            fld[key,i] = fld[key,i] "," $i
        }
        else {
            fld[key,i] = $i
        }
    }
}
END {
    for (key in keys) {
        for (i=1; i<=NF; i++) {
            printf "%s%s", fld[key,i], (i<NF?OFS:ORS)
        }
    }
}
' file
 558000002|11111112B|LONDON|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
 558000001|11111111B|LONDON|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|
 558000003|11111113B|LONDON,BERLIN|NAME FAKE3|SURNAME FAKE3|SURNAMEFAKE_3|NAME SURNAME1 SURNAME2|
 557000002|11111112A|BERLIN|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
 557000001|11111111A|BERLIN|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|

答案 1 :(得分:1)

awk way

首次出现时会打印出所有内容(可能会被改进/缩短)

 awk -F'|' -vOFS="|" 'b[$2]{split(a[$2],c,"|");gsub(/.*/,c[3]",&",$3)}{a[$2]=$0;if(!b[$2])d[NR]=$2;b[$2]++}END{for(i=1;i<=NR;i++)if(d[i])print a[d[i]]}' file

分手了

 awk -F'|' -vOFS="|" '
      b[$2]{split(a[$2],c,"|")
            gsub(/.*/,c[3]",&",$3)
     }
     {a[$2]=$0
     if(!b[$2])d[NR]=$2
     b[$2]++
     }
     END{for(i=1;i<=NR;i++)if(d[i])print a[d[i]]}' file

如果单字符数组名称有问题

 awk -F'|' -vOFS="|" '
      Count[$2]{split(Line[$2],Arr,"|")
            gsub(/.*/,Arr[3]",&",$3)
     }
     {Line[$2]=$0
     if(!Count[$2])Key[NR]=$2
     Count[$2]++
     }
     END{for(i=1;i<=NR;i++)if(Key[i])print Line[Key[i]]}' file

输出

558000003|11111113B|LONDON,BERLIN|NAME FAKE3|SURNAME FAKE3|SURNAMEFAKE_3|NAME SURNAME1 SURNAME2|
558000002|11111112B|LONDON|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
558000001|11111111B|LONDON|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|
557000002|11111112A|BERLIN|NAME FAKE2|SURNAME FAKE2|SURNAMEFAKE_2|NAME SURNAME1 SURNAME2|
557000001|11111111A|BERLIN|NAME FAKE1|SURNAME FAKE1|SURNAMEFAKE_1|NAME SURNAME1 SURNAME2|