删除重复项,但只保留linux文件中的最后一次出现

时间:2016-06-10 13:07:34

标签: linux shell awk

INPUT FILE:

5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,,user,,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,C
5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,,user,,f660818af5625b3be61fe12489689601,50328589469,,,30002,C
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,,user,,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,C
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,,user,,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,C
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,Nawras,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C

期望的输出:

5,,OR1,1000,UY,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H 
5,,OR2,2000,UY,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H    
5,,OR1,1000,UY,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H    
0,,OR5,5000,UY,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,UY,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C*

使用代码:

for i in `cat file | awk -F, '{print $13}' | sort | uniq`
do
grep $i file | tail -1 >> TESTINGGGGGGG_SV
done

这需要很长时间,因为该文件有3亿条记录,并且在第13列有6500万uniq记录。

所以我需要一个可以遍历第13列值的输出 - 最后一次出现在文件中作为输出。

2 个答案:

答案 0 :(得分:1)

awk救援!

awk -F, 'p!=$13 && p0 {print p0} {p=$13; p0=$0} END{print p0}' file

期望排序输入。

如果您可以成功运行该脚本,请发布时间。

如果无法排序,则另一个选项是

tac file | awk -F, '!a[$13]++' | tac

撤消该文件,以13美元的价格取回第一个条目并将结果反转。

答案 1 :(得分:0)

这是一个应该有效的解决方案:

awk -F, '{rows[$13]=$0} END {for (i in rows) print rows[i]}' file

说明:

  • rows是由字段13 $13索引的关联数组,每次有字段13的副本时,$13索引的数组元素都会被覆盖;它的值是整行$0

但由于保存数组所需的空间,这在内存方面效率低下。

对仍未使用排序的上述解决方案的改进是只保存关联数组中的行号:

awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' file|while read lN; do sed "${lN}q;d" file; done

说明:

    像以前一样
  • rows,但值是行号,而不是整行
  • awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}'文件输出包含所搜索行的行号列表
  • sed "${lN}q;d"lN
  • 获取行号file