我有一系列大文件(200 G),每个文件都经过排序并包含重复项,如下所示:
50.21.180.100|a.ac
50.21.180.100|a.ac
50.21.180.100|a.ac
50.21.180.100|a.ac
50.21.180.100|a.ac
50.21.180.100| b.ac
50.21.180.100| b.ac
50.21.180.100|b.ac
50.21.180.100|b.ac
50.21.180.100|b.ac
50.21.180.100| c.ac
50.21.180.100| c.ac
50.21.180.100|c.ac
50.21.180.100|c.ac
50.21.180.100|c.ac
50.21.180.100|c.ac
50.21.180.100| d.ac
预期产出:
50.21.180.100|a.ac
50.21.180.100|b.ac
50.21.180.100|c.ac
50.21.180.100|d.ac
是否有任何机构对删除这些副本的最佳方式(时间和记忆)有任何建议?是用Linux bash还是Python或其他语言?
答案 0 :(得分:2)
首先删除空格,然后运行uniq:
cat infile.txt | tr -d " " | uniq > outfile.txt