我的文件中的数据如下所示,有多列:
A B
Tiger Animal
Parrot Bird
Lion Animal
Elephant Animal
Crow Bird
Horse Animal
Man Human
Dog Animal
我想在A列中找到与B列中不同条目对应的条目数。如果可能在R中,或者可能是perl脚本。
输出为:
Animal 5
Bird 2
Human 1
此外,如果可能的话,可以找出列A中的条目是否已经重复列B中的不同条目,如
A B
Tiger Animal
Tiger Animal
答案 0 :(得分:5)
tapply
将很好地解决这个问题。
with(anm, tapply(A, B, function(x) length(unique(x))))
答案 1 :(得分:3)
这是用R完成的解决方案。这就是你要找的东西吗?
> anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
+ B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))
> anm
A B
1 Tiger Animal
2 Parrot Bird
3 Lion Animal
4 Elephant Animal
5 Crow Bird
6 Horse Animal
7 Man Human
8 Dog Animal
9 Tiger Animal
> (col.anm <- colSums(table(anm)))
Animal Bird Human
6 2 1
> table(anm)
B
A Animal Bird Human
Crow 0 1 0
Dog 1 0 0
Elephant 1 0 0
Horse 1 0 0
Lion 1 0 0
Man 0 0 1
Parrot 0 1 0
Tiger 2 0 0 # you can see how many times entry from A comes up
修改的
要获得评论中所述的所需输出格式,请将结果包装在data.frame
。
> data.frame(col.anm)
col.anm
Animal 6
Bird 2
Human 1
答案 2 :(得分:2)
如果您的数据位于R中,则可以使用table()
来获取所需内容。首先是一些示例数据:
dat <- data.frame(A=c("tiger","parrot","lion","tiger"),B=c("animal","bird","animal","animal"))
然后我们可以通过以下方式获得B
的计数:
table(dat$B)
和共同出现的次数:
table(dat)
要获取您指定的表格,我们可以使用plyr
包:
library("plyr")
tab <- ddply(dat,.(A,B),nrow)
tab[tab$V1>1,]
A B V1
3 tiger animal 2
答案 3 :(得分:2)
不确定我是否在文件中获得完整的数据结构,但如果你在UNIX上:
tr -s ' ' | sort -u | awk '{ print $2}' | sort | uniq -c
5 Animal
2 Bird
1 Human
以上作品,即使我添加了这一行:“老虎动物”最后,因为第一个排序-u。
tr -s挤出多个空格(因此sort命令按预期运行)
答案 4 :(得分:2)
如果有其他人来到这里,可以采用以下几种方法。
myout <- lapply(split(anm,list(anm$B)),function(x)
list(length(unique(x[,"A"])),x[duplicated(x),"A"])
)
unlist(sapply(myout,function(x)x[1])) # counts in each category
sapply(myout,function(x)x[-1]) # list of duplicated names
...或
library(data.table)
mydt <- data.table(anm,key="B")
mydt[,.N,by=key(mydt)]
mydt[,.N,by="B,A"][N>1]
其中....
anm = read.table(textConnection(
"Tiger Animal
Parrot Bird
Lion Animal
Elephant Animal
Crow Bird
Horse Animal
Man Human
Dog Animal
Tiger Animal"))
names(anm) <- c("A","B")
编辑:编辑回应Matthew Dowle的评论(data.table的作者)。
答案 5 :(得分:1)
您可以使用awk
awk '{ myarray[$2]++ } END { for ( key in myarray ) { print key ": " myarray[key] } }' FILE
第二个有点棘手...(http://ideone.com/xdKcs)
awk '{ myarray[$2]++ ; myarray2[$2, $1]++ }
END { for ( key in myarray ) { print key ": " myarray[key] }
print
print "Duplicates: "
for (key in myarray2) {
split(key,sep,SUBSEP)
if (myarray2[sep[1], sep[2]]>1)
{ print sep[1] ": " sep[2] " " myarray2[sep[1], sep[2]]
}}}' FILE
答案 6 :(得分:1)
以下是使用R中的plyr
包的方法。
mydf = read.table(textConnection(
"Tiger Animal
Parrot Bird
Lion Animal
Elephant Animal
Crow Bird
Horse Animal
Man Human
Dog Animal
Tiger Animal"))
library(plyr)
ddply(mydf, .(V2), summarize, V3 = length(V1))
V2 V3
1 Animal 6
2 Bird 2
3 Human 1
ddply(mydf, .(V2, V1), summarize, V3 = length(V1))
V2 V1 V3
1 Animal Dog 1
2 Animal Elephant 1
3 Animal Horse 1
4 Animal Lion 1
5 Animal Tiger 2
6 Bird Crow 1
7 Bird Parrot 1
8 Human Man 1
EDIT。添加每个类别中的动物名称
ddply(mydf, .(V2), summarize,
V3 = length(V1),
V4 = do.call("paste", as.list(unique(V1))))
V2 V3 V4
1 Animal 6 Tiger Lion Elephant Horse Dog
2 Bird 2 Parrot Crow
3 Human 1 Man
答案 7 :(得分:1)
如果您对SQL更熟悉,可以使用R中的sqldf
包来解决这个问题:
anm <- data.frame(A = c("Tiger", "Parrot", "Lion", "Elephant", "Crow", "Horse", "Man", "Dog", "Tiger"),
B = c("Animal", "Bird", "Animal", "Animal", "Bird", "Animal", "Human", "Animal", "Animal"))
library(sqldf)
sqldf("select B,count(distinct A) tot from anm group by B")
sqldf("select B,A,count(*) num from anm group by B,A HAVING num > 1")
答案 8 :(得分:1)
在Perl中(隐含strict
和warnings
。)
my ( %uniq, %count_for );
# here $fh = some input source
while ( <$fh> ) {
s/^\s+//; # trim left
s/\s*$//; # trim right (and chomp)
# This split allows for spaces between words in a single column
# allows also for tab-delimited record
my @cols = split /(?:\t|\s{2,})/;
# Normalize the text and test for uniqueness:
#
# By these manipulations:
# Tiger Animal
# matches
# Tiger Animal
# for any column irregularities
next if $uniq{join('-',@cols)};
# count occurrence.
$count_for{$cols[1]}++;
}
答案 9 :(得分:0)
#!/usr/bin/env perl
use strict;
use warnings;
use File::Slurp qw(slurp);
exit unless $ARGV[0];
my @data = slurp($ARGV[0]);
my (%h);
for (@data) {
chomp;
map { next if /^(A|B)$/; $h{$_}++ } split ' ', $_;
}
map { print $_, ": ", $h{$_}, "\n" } keys %h;
用法:
$ perl script.pl columns.txt