假设我有一个三列的数据框:第一个指定一个要素的编号(例如颜色),第二个指定一个组,第三个指定该组中的特征(1)或缺少那组(0):
> d<-data.frame(feature=c("red","blue","green","yellow","red","blue","green","yellow"), group=c(rep("a",4),rep("b",4)),is_there=c(0,1,1,0,1,1,1,0))
> d
feature group is_there
1 red a 0
2 blue a 1
3 green a 1
4 yellow a 0
5 red b 1
6 blue b 1
7 green b 1
8 yellow b 0
现在我想总结一下有多少特征:1。仅在组a中,仅在组b中,以及在两组中存在多少特征。另外,我需要提取两个组中存在的功能的名称。我怎样才能做到这一点?我想像crossprod
这样的函数可能会有所帮助,但我无法弄明白。
输出类似于:
feature
red 1
blue 2
green 2
yellow 0
或:
feature a b
red 0 1
blue 1 1
green 1 1
yellow 0 0
无论如何,我需要对一个非常大的数据文件进行更好的概述(原始文件在大约10个组中有数百个功能)。
答案 0 :(得分:2)
听起来像table
就是你想要的。首先,我们对行进行子集,使is_there
列等于1,然后删除第三列。然后我们在该子集上调用table
。
> ( tab <- table(d[d$is_there == 1, -3]) )
# group
# feature a b
# blue 1 1
# green 1 1
# red 0 1
# yellow 0 0
table
是一个类似矩阵的对象。我们可以像操作matrix
一样操作它。
查看小组a
:
> tab[,"a"] ## vector of group "a"
# blue green red yellow
# 1 1 0 0
> tab[,"a"][ tab[,"a"] > 0 ] ## present in group "a"
# blue green
# 1 1
> names(tab[,"a"][ tab[,"a"] > 0 ]) ## "feature" present in group "a"
# [1] "blue" "green"
对于小组b
也一样。
答案 1 :(得分:1)
tbl <- table(d$feature[!!d$is_there], d$group[!!d$is_there])
rowSums(tbl)
#blue green red yellow
# 2 2 1 0
tbl
# a b
#blue 1 1
#green 1 1
#red 0 1
#yellow 0 0
如果您想进行如下分组:
d1 <- as.data.frame(matrix(rep(c("none", "only", "both")[rowSums(tbl)+1],
each=2), ncol=2, byrow=TRUE, dimnames=dimnames(tbl)),
stringsAsFactors=FALSE)
d1[!tbl & rowSums(tbl)==1] <- ""
d1
# a b
#blue both both
#green both both
#red only
#yellow none none
答案 2 :(得分:1)
请尝试以下代码:
with(d, tapply(is_there, list(feature, group), sum))
# a b
#blue 1 1
#green 1 1
#red 0 1
#yellow 0 0
答案 3 :(得分:0)
那会诀窍吗?
> tapply(d$feature[d$is_there==1],d$group[d$is_there==1], table)
$a
blue green red yellow
1 1 0 0
$b
blue green red yellow
1 1 1 0
答案 4 :(得分:0)
采取以下数据框:
myd <- data.frame(
feature=c("red","blue","green","yellow","red","blue","green","yellow"),
group=c(rep("a",4),rep("b",4)),
is_there=c(0,1,1,0,1,0,1,0))
要获得告诉您所有内容的因素,您可以尝试以下代码:
require(reshape2)
res <- acast(myd,feature ~ group, fun=sum, value.var="is_there")
where <- factor(
colSums(res) - 2*diff(t(res)),
levels=c(-1,0,2,3),
labels=c("group2","nowhere","both","group1")
)
给予:
> res
a b
blue 1 0
green 1 1
red 0 1
yellow 0 0
> where
blue green red yellow
group1 both group2 nowhere
Levels: group2 nowhere both group1
从这里提取到处存在的那些是微不足道的。
请注意,为您提供矩阵res
的任何其他解决方案都同样有效(tapply解决方案会更快)