Question

我有一个制表符分隔的文件，如下所示：

A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234

Output: 
A 3
B 2
C 1

基本上，我需要计数属于第一列的唯一值，所有这些都与管道一起在一个突击队中。如您所见，可能会有一些重复，例如“ A 1234”。我对awk或cut有一些想法，但似乎都不起作用。他们只是打印出所有唯一对，而考虑到第一列中的值，我需要第二列中的 count 个唯一值。

awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci

非常感谢您的帮助！先感谢您。

Answer 1

使用完整的awk解决方案，您可以尝试执行以下操作。

awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file

说明： ：为此添加了详细说明。

awk '                  ##Starting awk program from here.
BEGIN{
  FS=OFS="\t"
}
!found[$0]++{       ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
  val[$1]++            ##Creating val with 1st column inex and keep increasing its value here.
}
END{                   ##Starting END block of this progra from here.
  for(i in val){       ##Traversing through array val here.
    print i,val[i]     ##Printing i and value of val with index i here.
  }
}
'  Input_file          ##Mentioning Input_file name here.

Answer 2

使用GNU awk：

$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file

输出：

A 3
B 2
C 1

解释：

 $ gawk -F\\t '{               # using GNU awk and tab as delimiter
    a[$1][$2]                  # hash to 2D array
 }
 END {                         
     for(i in a)               # for all values in first field
         print i,length(a[i])  # output value and the size of related array
 }' file

Answer 3

您可以尝试以下方法：

cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'

它适用于您的示例。（但是我不确定它是否适用于其他情况。请让我知道它是否无效！）

Answer 4

$ sort -u file | cut -f1 | uniq -c
   3 A
   2 B
   1 C

Answer 5

另一种方法，使用方便的GNU datamash实用程序：

$ datamash -g1 countunique 2 < input.txt
A   3
B   2
C   1

要求输入文件在第一列上进行排序，例如您的样本。如果不是真实文件，请在选项中添加-s。

根据bash中的两列计算唯一值的数量

5 个答案: