Question

我有一个12Gb的组合哈希列表文件。我需要找到重复的内容，但我一直有一些问题。

使用cat *.txt > _uniq_combined.txt合并了一些920（uniq＆＃39; d）列表，从而产生了大量的哈希值。合并后，最终列表将包含重复项。

我以为我用awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni

弄明白了

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt会生成大小为4574766572字节的文件。

有人告诉我，一个文件很大，不可能再试一次。

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt会生成大小为1624577643字节的文件。明显更小。

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt会生成一个大小为1416298458字节的文件。

我开始认为我不知道这些命令的作用，因为文件大小应该相同。

同样，我们的目标是查看一个巨大的列表并保存不止一次看到的哈希实例。哪些（如果有的话）这些结果是正确的？我以为他们都做同样的事情。

Answer 1

sort专门用于处理大文件。你可以这样做：

cat *.txt | sort >all_sorted 
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates

Answer 2

dict1 = { 'Bob VS Sarah': { 'shepherd': 1, 'collie': 5, 'poodle': 8 }, 'Bob VS Ann': { 'shepherd': 3, 'collie': 2, 'poodle': 1 }, 'Bob VS Jen': { 'shepherd': 3, 'collie': 2, 'poodle': 2 }, 'Sarah VS Bob': { 'shepherd': 3, 'collie': 2, 'poodle': 4 }, 'Sarah VS Ann': { 'shepherd': 4, 'collie': 6, 'poodle': 3 }, 'Sarah VS Jen': { 'shepherd': 1, 'collie': 5, 'poodle': 8 }, 'Jen VS Bob': { 'shepherd': 4, 'collie': 8, 'poodle': 1 }, 'Jen VS Sarah': { 'shepherd': 7, 'collie': 9, 'poodle': 2 }, 'Jen VS Ann': { 'shepherd': 3, 'collie': 7, 'poodle': 2 }, 'Ann VS Bob': { 'shepherd': 6, 'collie': 2, 'poodle': 5 }, 'Ann VS Sarah': { 'shepherd': 0, 'collie': 2, 'poodle': 4 }, 'Ann VS Jen': { 'shepherd': 2, 'collie': 8, 'poodle': 2 }, 'Bob VS Bob': { 'shepherd': 3, 'collie': 2, 'poodle': 2 }, 'Sarah VS Sarah': { 'shepherd': 3, 'collie': 2, 'poodle': 2 }, 'Ann VS Ann': { 'shepherd': 13, 'collie': 2, 'poodle': 4 }, 'Jen VS Jen': { 'shepherd': 9, 'collie': 7, 'poodle': 2 } }命令应该可以正常使用12 GB文件。如果指定-d或-D选项，sort将只输出重复的行。那就是：

uniq

或

sort all_combined > all_sorted
uniq -d all_sorted > duplicates

-d选项为每个重复元素显示一行。因此，如果“foo”出现12次，它将显示“foo”一次。 -D打印所有重复项。

uniq -D all_sorted > all_duplicates会为您提供更多信息。

Answer 3

也许如果您split将该大文件放入较小的文件中，sort --unique将它们删除并尝试将它们与sort --merge合并：

$ cat > test1
1
1
2
2
3
3
$ cat > test2
2
3
3
4
4
$ sort -m -u test1 test2
1
2
3
4

我认为合并排序的文件不会发生在内存中？

Answer 4

我认为您的awk脚本不正确，而您的uniq -c - 命令包含重复项的出现次数，而sort _uniq_combined.txt | uniq -d是正确的:)。

请注意，您可以直接sort *.txt > sorted_hashes或sort *.txt -o sorted_hashes。

如果您手边只有两个文件，请考虑使用comm（info coreutils进行救援），这可以在第一个文件＆＃34;中为您提供＆＃34;行的列表输出，＆＃34;排在第二个文件中＆＃34;，＆＃34;两个文件中的行＆＃34;。如果您只需要其中一些列，则可以使用comm选项来抑制其他列。或者使用生成的输出作为基础，并使用cut继续处理它，例如cut -f 1 my_three_colum_file以获取第一列。

如何使用sort，uniq或awk从大量列表中复制重复项？

4 个答案: