查找重复部分并将每个部分输出到单个文件?

时间:2014-05-08 03:06:58

标签: sorting awk uniq

获取包含以下行的文本文件:

/user$ cat ORIGFILE 
se832p41iEC.200289_EDI832I140401232506.txt 
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt 
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt 
se832p41iEC.200289_EDI832I140401232507.txt 
xe832p41iEC.201687_EDI832I140401232513.txt 
xe832p41iEC.201687_EDI832I140401232511.txt

如果有重复的会话号(例如200289),它应该将每个重复部分输出到一个文件并显示如下:

 /user$ cat se832p41iEC.200289
 se832p41iEC.200289_EDI832I140401232506.txt
 se832p41iEC.200289_EDI832I140401232507.txt 
 se832p41iEC.200289_EDI832I140401232508.txt

 /user$ cat xe832p41iEC.201687
 xe832p41iEC.201687_EDI832I140401232511.txt
 xe832p41iEC.201687_EDI832I140401232512.txt
 xe832p41iEC.201687_EDI832I140401232513.txt

 /user$ cat NEWFILE
 pt832p41iEC.213631_EDI832I140401232501.txt
 pt832p41iEC.213632_EDI832I140401232502.txt

提前谢谢。

更新:在@ Jaypal的提示(感谢男人)后想出来:

  First - sort ORIGFILE| uniq -u > NEWFILE
  Second - sort ORIGFILE | uniq -D > AWKFILE
  Last - awk -F_ '{print $0 > $1}' AWKFILE

1 个答案:

答案 0 :(得分:1)

现在您已添加了尝试,以下是使用awk

执行此操作的方法
$ ls
file

$ cat file
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt

$ awk -F_ '{
    a[$1] = (a[$1] ? a[$1] RS $0 : $0)
    b[$1]++
}
END {
    for(x in a) print a[x] > (b[x]>1 ? x : "NEWFILE")
}' file

$ ls
NEWFILE  file  se832p41iEC.200289  xe832p41iEC.201687

$ head *
==> NEWFILE <==
pt832p41iEC.213631_EDI832I140401232501.txt
pt832p41iEC.213632_EDI832I140401232502.txt

==> file <==
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt

==> se832p41iEC.200289 <==
se832p41iEC.200289_EDI832I140401232506.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt

==> xe832p41iEC.201687 <==
xe832p41iEC.201687_EDI832I140401232512.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt