如何根据另一列的不同条目计算列的条目数

时间:2011-07-08 11:09:05

标签: perl r aggregate

我的文件中的数据如下所示,有多列:

A                B             

Tiger         Animal         
Parrot        Bird
Lion          Animal
Elephant      Animal
Crow          Bird
Horse         Animal
Man           Human
Dog           Animal

我想在A列中找到与B列中不同条目对应的条目数。如果可能在R中,或者可能是perl脚本。

输出为:

Animal 5
Bird   2
Human  1

此外,如果可能的话,可以找出列A中的条目是否已经重复列B中的不同条目,如

A              B   
Tiger         Animal         
Tiger         Animal

10 个答案:

答案 0 :(得分:5)

来自基地R的

tapply将很好地解决这个问题。

with(anm, tapply(A, B, function(x) length(unique(x))))

答案 1 :(得分:3)

这是用R完成的解决方案。这就是你要找的东西吗?

> anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
+       B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))
> anm
         A      B
1    Tiger Animal
2   Parrot   Bird
3     Lion Animal
4 Elephant Animal
5     Crow   Bird
6    Horse Animal
7      Man  Human
8      Dog Animal
9    Tiger Animal
> (col.anm <- colSums(table(anm)))
Animal   Bird  Human 
     6      2      1 
> table(anm)
          B
A          Animal Bird Human
  Crow          0    1     0
  Dog           1    0     0
  Elephant      1    0     0
  Horse         1    0     0
  Lion          1    0     0
  Man           0    0     1
  Parrot        0    1     0
  Tiger         2    0     0 # you can see how many times entry from A comes up

修改

要获得评论中所述的所需输出格式,请将结果包装在data.frame

> data.frame(col.anm)
       col.anm
Animal       6
Bird         2
Human        1

答案 2 :(得分:2)

如果您的数据位于R中,则可以使用table()来获取所需内容。首先是一些示例数据:

dat <- data.frame(A=c("tiger","parrot","lion","tiger"),B=c("animal","bird","animal","animal"))

然后我们可以通过以下方式获得B的计数:

table(dat$B)

和共同出现的次数:

table(dat)

要获取您指定的表格,我们可以使用plyr包:

library("plyr")
tab <- ddply(dat,.(A,B),nrow)
tab[tab$V1>1,]
      A      B V1
3 tiger animal  2

答案 3 :(得分:2)

不确定我是否在文件中获得完整的数据结构,但如果你在UNIX上:

tr -s ' ' | sort -u | awk '{ print $2}' | sort | uniq -c


5 Animal
2 Bird
1 Human

以上作品,即使我添加了这一行:“老虎动物”最后,因为第一个排序-u。

tr -s挤出多个空格(因此sort命令按预期运行)

答案 4 :(得分:2)

如果有其他人来到这里,可以采用以下几种方法。

myout   <-  lapply(split(anm,list(anm$B)),function(x)
            list(length(unique(x[,"A"])),x[duplicated(x),"A"])
        )
unlist(sapply(myout,function(x)x[1])) # counts in each category
sapply(myout,function(x)x[-1]) # list of duplicated names

...或

library(data.table)
mydt <- data.table(anm,key="B")
mydt[,.N,by=key(mydt)]
mydt[,.N,by="B,A"][N>1]

其中....

anm = read.table(textConnection(
    "Tiger    Animal         
    Parrot    Bird
    Lion      Animal
    Elephant  Animal
    Crow      Bird
    Horse     Animal
    Man       Human
    Dog       Animal
    Tiger     Animal"))
names(anm) <- c("A","B")

编辑:编辑回应Matthew Dowle的评论(data.table的作者)。

答案 5 :(得分:1)

您可以使用awk

轻松完成第一项操作
awk '{ myarray[$2]++ } END { for ( key in myarray ) { print key ": " myarray[key] } }' FILE

第二个有点棘手...(http://ideone.com/xdKcs

awk '{ myarray[$2]++ ; myarray2[$2, $1]++ } 
     END { for ( key in myarray ) { print key ": " myarray[key] } 
           print
           print "Duplicates: "
           for (key in myarray2) { 
               split(key,sep,SUBSEP)
               if (myarray2[sep[1], sep[2]]>1)
                   { print sep[1] ": " sep[2] " " myarray2[sep[1], sep[2]]
     }}}' FILE

答案 6 :(得分:1)

以下是使用R中的plyr包的方法。

mydf = read.table(textConnection(
"Tiger    Animal         
Parrot    Bird
Lion      Animal
Elephant  Animal
Crow      Bird
Horse     Animal
Man       Human
Dog       Animal
Tiger     Animal"))

library(plyr)
ddply(mydf, .(V2), summarize, V3 = length(V1))

    V2    V3
1 Animal  6
2   Bird  2
3  Human  1

ddply(mydf, .(V2, V1), summarize, V3 = length(V1))

    V2         V1  V3
1 Animal      Dog  1
2 Animal Elephant  1
3 Animal    Horse  1
4 Animal     Lion  1
5 Animal    Tiger  2
6   Bird     Crow  1
7   Bird   Parrot  1
8  Human      Man  1

EDIT。添加每个类别中的动物名称

 ddply(mydf, .(V2), summarize, 
    V3 = length(V1), 
    V4 = do.call("paste", as.list(unique(V1))))

      V2 V3                            V4
1 Animal  6 Tiger Lion Elephant Horse Dog
2   Bird  2                   Parrot Crow
3  Human  1                           Man

答案 7 :(得分:1)

如果您对SQL更熟悉,可以使用R中的sqldf包来解决这个问题:

anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
      B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))

library(sqldf)      
sqldf("select B,count(distinct A) tot from anm group by B")

sqldf("select B,A,count(*) num from anm group by B,A HAVING num > 1")

答案 8 :(得分:1)

在Perl中(隐含strictwarnings。)

my ( %uniq, %count_for );
# here $fh = some input source
while ( <$fh> ) {
    s/^\s+//; # trim left
    s/\s*$//; # trim right (and chomp)
    # This split allows for spaces between words in a single column
    # allows also for tab-delimited record
    my @cols = split /(?:\t|\s{2,})/;
    # Normalize the text and test for uniqueness:
    #
    # By these manipulations: 
    #     Tiger   Animal
    # matches
    #     Tiger      Animal
    # for any column irregularities
    next if $uniq{join('-',@cols)};

    # count occurrence.
    $count_for{$cols[1]}++;
}

答案 9 :(得分:0)

#!/usr/bin/env perl

use strict;
use warnings;
use File::Slurp qw(slurp);

exit unless $ARGV[0];

my @data = slurp($ARGV[0]);
my (%h);

for (@data) {
  chomp;
  map { next if /^(A|B)$/; $h{$_}++ } split ' ', $_;
}

map { print $_, ": ", $h{$_}, "\n" } keys %h;

用法:

$ perl script.pl columns.txt