Question

我想得到一个表格，其中包含其他因子变量中变量的前10个绝对和相对频率。我有一个包含3列的数据框：1列是因子变量，第2列是我需要计数的其他变量，3是作为约束的逻辑变量。（真正的数据库有超过4毫升的观察结果）

dtf<-data.frame(c("a","a","b","c","b"),c("aaa","bbb","aaa","aaa","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE))
colnames(dtf)<-c("factor","var","log")
dtf

factor var   log
1      a aaa  TRUE
2      a bbb FALSE
3      b aaa  TRUE
4      c aaa  TRUE
5      b bbb  TRUE

所以我需要找到“var”的顶级绝对和相对频率，其中“log”== TRUE跨越“factor”的每个因素。

我用绝对频率试过这个（在真正的数据库中，我提取前十名，这里我得到2行）：

t1<-tapply(dtf$var[dtf$log==T],dtf$factor[dtf$log==T],function(x)(head(sort(table(x),decreasing=T),n=2L)))
# Returns array of lists: list of factors containing list of top frequencies
t2<-(t1, ldply)
# Split list inside by id and freq
t3<-do.call(rbind, lapply(t2, data.frame))
# Returns dataframe of top "var" values and corresponding freq for each group in "factor"
# Factor variable's labels are saved as row.names in t3

以下功能有助于查找整个数据库的相对频率，而不是按因子分组：

getrelfreq<-function(x){
v<-table(x)
v_rel<-v/nrow(dtf[dtf$log==T,])
head(sort(v_rel,decreasing=T),n=2L)}

但我有相对频率的问题，因为我需要将绝对频率除以“var”BY EACH因子的行数，而不是“var”的TOTAL nrow，其中“log”== T.我不知道如何在tapply循环中使用它，使得分母对于每个因子都是不同的。我也想在1个tapply循环中使用这两个函数，而不是生成许多表并合并结果。但不知道如何将这两个功能放在一起。

请帮助：）

Answer 1

如果我理解正确，你可以做一些像我下面写的那样的事情。使用dcast获取每个var上每个factor的频率，然后使用rowSums()将它们相加以获得所有因子中每个变量的绝对频率。您可以使用prop.table计算每个var中每个factor的相对频率。请注意，我对您的示例数据进行了细微更改，因此您可以关注每个阶段的情况（'bbb'时我为factor b添加了log == TRUE值。试试这个：

#Data frame (note 2 values for 'bbb' for factor 'b' when log == TRUE)
dtf<-data.frame(c("a","a","b","c","b","b"),c("aaa","bbb","aaa","aaa","bbb","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE,TRUE))
colnames(dtf)<-c("factor","var","log")
dtf
#     factor var   log
#1      a aaa  TRUE
#2      a bbb FALSE
#3      b aaa  TRUE
#4      c aaa  TRUE
#5      b bbb  TRUE
#6      b bbb  TRUE


library(reshape2)

# Find frequency of each var across each factor using dcast
mydat <- dcast( dtf[dtf$log==TRUE , ] , var ~ factor , sum )
#  var a b c
#1 aaa 1 1 1
#2 bbb 0 2 0

# Use rowSums to find absolute frequency of each var across all groups
mydat$counts <- rowSums( mydat[,-1] )
# Order by decreasing frequency and just use first 10 rows
mydat[ order( mydat$counts , decreasing = TRUE ) , ]
#  var a b c counts
#1 aaa 1 1 1      3
#2 bbb 0 2 0      2


# Relative proportions for each var across the factors
data.frame( var = mydat$var , round( prop.table( as.matrix( mydat[,-c(1,ncol(mydat))]) , 1 ) , 2 ) )
#  var    a    b    c
#1 aaa 0.33 0.33 0.33
#2 bbb 0.00 1.00 0.00

R：r因子的相对频率

1 个答案: