我有一个很大的文件,其链接如下:
http://aaaa1.com/weblink/link1.html#XXX
http://aaaa1.com/weblink/link1.html#XYZ
http://aaaa1.com/weblink/link2.html#XXX
http://bbbb.com/web1/index.php?index=1
http://bbbb.com/web1/index1.php?index=1
http://bbbb.com/web1/index1.php?index=2
http://bbbb.com/web1/index1.php?index=3
http://bbbb.com/web1/index1.php?index=4
http://bbbb.com/web1/index1.php?index=5
http://bbbb.com/web1/index2.php?index=666
http://bbbb.com/web1/index3.php?index=666
http://bbbb.com/web1/index4.php?index=5
我想删除所有重复的链接并保留:
http://aaaa1.com/weblink/link1.html#XXX
http://aaaa1.com/weblink/link2.html#XXX
http://bbbb.com/web1/index.php?index=1
http://bbbb.com/web1/index1.php?index=1
http://bbbb.com/web1/index2.php?index=666
http://bbbb.com/web1/index3.php?index=666
http://bbbb.com/web1/index4.php?index=5
我该怎么做?
答案 0 :(得分:1)
请您尝试以下。
awk -F'[#?]' '!a[$1]++' Input_file
上述代码的解释:
awk -F'[#?]' ' ##Starting awk script from here and making field separator as #(hash) and ?(literal character) as per OP sample Input_file provided.
!a[$1]++ ##Creating an array whose index is $1(first field of current line). Checking condition if $1 is NOT present in a then increase its value with 1.
##And ! condition will make sure each lines $1 should come only 1 time in array a so by doing this all duplicates will NOT be printed.
' Input_file ##Mentioning Input_file name here.
答案 1 :(得分:0)
我希望这会清除文件中所有重复的链接,但应该有完全相同的值。
sort -u your-link-file.txt
如果您要将其存储到另一个文件中,请使用此
cat your-link-file.txt | sort -u > result.txt