Question

我正在尝试在包含两列的数据列表中查找唯一且重复的数据。我真的只想比较第1列中的数据。

数据可能如下所示（由制表符分隔）：

What are you doing?     Che cosa stai facendo?
WHAT ARE YOU DOING?     Che diavolo stai facendo?
what are you doing?     Qual è il tuo problema amico?

所以我一直在玩弄以下内容：

排序而不会忽略大小写（只是“排序”，没有-f选项）可以减少重复次数

gawk'{FS =“\ t”;打印$ 1}'EN-IT_Corpus.txt | 排序 | uniq -i -D＆gt;愚弄
使用忽略大小写排序（“sort -f”）会给我更多重复

gawk'{FS =“\ t”;打印$ 1}'EN-IT_Corpus.txt | sort -f | uniq -i -D＆gt;愚弄

如果我想找到忽略大小写的副本，我是否认为＃2更准确，因为它首先忽略大小写，然后根据排序的数据找到重复项？

据我所知，我无法将sort和unique命令组合在一起，因为sort没有显示重复项的选项。

谢谢，史蒂夫

Answer 1

我认为关键是预处理数据：

file="EN-IT_Corpus.txt"
dups="dupes.$$"
sed 's/        .*//' $file | sort -f | uniq -i -D > $dups
fgrep -i -f $dups $file

sed命令只生成英文单词;这些按字符串不敏感排序，然后不区分大小写地运行uniq，只打印重复的条目。然后再次处理数据文件，查找包含fgrep或grep -F的重复密钥，指定要在文件-f $dups中查找的模式。显然（我希望）sed命令中的大空格是一个标签;您可以根据shell和\t等来编写sed。

事实上，使用GNU grep，您可以：

sed 's/        .*//' $file |
sort -f |
uniq -i -D |
fgrep -i -f - $file

如果重复数量非常大，你可以用以下方式将它们压缩：

sed 's/        .*//' $file |
sort -f |
uniq -i -D |
sort -f -u |
fgrep -i -f - $file

给出输入数据：

What a surprise?        Vous etes surpris?
What are you doing?        Che cosa stai facendo?
WHAT ARE YOU DOING?        Che diavolo stai facendo?
Provacation         Provacatore
what are you doing?        Qual è il tuo problema amico?
Ambiguous        Ambiguere

所有这些的输出是：

What are you doing?        Che cosa stai facendo?
WHAT ARE YOU DOING?        Che diavolo stai facendo?
what are you doing?        Qual è il tuo problema amico?

Answer 2

或者这个：

独特：

awk '!arr[tolower($1)]++'  inputfile > unique.txt

重复

awk '{arr[tolower($1)]++; next} 
END{for (i in arr {if(arr[i]>1){print i, "count:", arr[i]}} }' inputfile > dup.txt

Answer 3

您可以保持简单：

sort -uf
#where sort -u = the unique findings
#      sort -f = insensitive case

使用linux命令“sort -f | uniq -i”一起忽略大小写

3 个答案: